Money Lion stage 2 project

Author

Karim Abdul Aziz Chatab

Predicting Loan Repayment

1. Introduction

1.1 Business Context

This project was given 3 dataset from data.zip file where I, as the (potential) data scientist at MoneyLion plans to assess the loan repayment quality of the given customer in the dataset.

1.2 Required Libraries

show code
```{python}
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from missingno import matrix
import missingno
import plotly.express as px
from scipy import stats
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
import time
```

1.3 Dataset Overview

Full information of the dataset is referred from the dictionary. But in short,

loan.csv

This file returns 19 columns consist of the customer’s loan information along with their ssn and id. Every row represents and accepted loan application.

payment.csv

This file returns 9 columns consist of the customer’s payment information along with their ACH error codes and loan id.

clarity_underwriting_variables.csv

This file returns 54 rows and 10 columns where each row variable determine the underwriting report.

Importing the dataset:

show code
```{python}
loan = pd.read_csv('data/loan.csv')
payment = pd.read_csv('data/payment.csv')
clarity = pd.read_csv('data/clarity_underwriting_variables.csv')
```

2. Data Preprocessing

2.1 Exploratory Data Analysis

show code
```{python}
#display loan
print(loan.head().to_string()) 
```
          loanId                          anon_ssn payFrequency    apr             applicationDate  originated              originatedDate  nPaidOff  approved  isFunded             loanStatus  loanAmount  originallyScheduledPaymentAmount state     leadType  leadCost fpStatus            clarityFraudId  hasCF
0  LL-I-07399092  beff4989be82aab4a5b47679216942fd            B  360.0  2016-02-23T17:29:01.940000       False                         NaN       0.0     False         0  Withdrawn Application       500.0                            978.27    IL  bvMandatory         6      NaN  5669ef78e4b0c9d3936440e6      1
1  LL-I-06644937  464f5d9ae4fa09ece4048d949191865c            B  199.0  2016-01-19T22:07:36.778000        True  2016-01-20T15:49:18.846000       0.0      True         1          Paid Off Loan      3000.0                           6395.19    CA    prescreen         0  Checked  569eb3a3e4b096699f685d64      1
2  LL-I-10707532  3c174ae9e2505a5f9ddbff9843281845            B  590.0  2016-08-01T13:51:14.709000       False                         NaN       0.0     False         0  Withdrawn Application       400.0                           1199.45    MO  bvMandatory         3      NaN  579eab11e4b0d0502870ef2f      1
3  LL-I-02272596  9be6f443bb97db7e95fa0c281d34da91            B  360.0  2015-08-06T23:58:08.880000       False                         NaN       0.0     False         0  Withdrawn Application       500.0                           1074.05    IL  bvMandatory         3      NaN  555b1e95e4b0f6f11b267c18      1
4  LL-I-09542882  63b5494f60b5c19c827c7b068443752c            B  590.0  2016-06-05T22:31:34.304000       False                         NaN       0.0     False         0               Rejected       350.0                            814.37    NV  bvMandatory         3      NaN  5754a91be4b0c6a2bf424772      1
show code
```{python}
#display clarity
print(clarity.head().to_string()) 
```
   .underwritingdataclarity.clearfraud.clearfraudinquiry.thirtydaysago  .underwritingdataclarity.clearfraud.clearfraudinquiry.twentyfourhoursago  .underwritingdataclarity.clearfraud.clearfraudinquiry.oneminuteago  .underwritingdataclarity.clearfraud.clearfraudinquiry.onehourago  .underwritingdataclarity.clearfraud.clearfraudinquiry.ninetydaysago  .underwritingdataclarity.clearfraud.clearfraudinquiry.sevendaysago  .underwritingdataclarity.clearfraud.clearfraudinquiry.tenminutesago  .underwritingdataclarity.clearfraud.clearfraudinquiry.fifteendaysago  .underwritingdataclarity.clearfraud.clearfraudinquiry.threesixtyfivedaysago .underwritingdataclarity.clearfraud.clearfraudindicator.inquiryonfilecurrentaddressconflict  .underwritingdataclarity.clearfraud.clearfraudindicator.totalnumberoffraudindicators .underwritingdataclarity.clearfraud.clearfraudindicator.telephonenumberinconsistentwithaddress .underwritingdataclarity.clearfraud.clearfraudindicator.inquiryageyoungerthanssnissuedate .underwritingdataclarity.clearfraud.clearfraudindicator.onfileaddresscautious .underwritingdataclarity.clearfraud.clearfraudindicator.inquiryaddressnonresidential .underwritingdataclarity.clearfraud.clearfraudindicator.onfileaddresshighrisk .underwritingdataclarity.clearfraud.clearfraudindicator.ssnreportedmorefrequentlyforanother .underwritingdataclarity.clearfraud.clearfraudindicator.currentaddressreportedbytradeopenlt90days .underwritingdataclarity.clearfraud.clearfraudindicator.inputssninvalid .underwritingdataclarity.clearfraud.clearfraudindicator.inputssnissuedatecannotbeverified .underwritingdataclarity.clearfraud.clearfraudindicator.inquiryaddresscautious .underwritingdataclarity.clearfraud.clearfraudindicator.morethan3inquiriesinthelast30days .underwritingdataclarity.clearfraud.clearfraudindicator.onfileaddressnonresidential .underwritingdataclarity.clearfraud.clearfraudindicator.creditestablishedpriortossnissuedate .underwritingdataclarity.clearfraud.clearfraudindicator.driverlicenseformatinvalid .underwritingdataclarity.clearfraud.clearfraudindicator.inputssnrecordedasdeceased .underwritingdataclarity.clearfraud.clearfraudindicator.inquiryaddresshighrisk .underwritingdataclarity.clearfraud.clearfraudindicator.inquirycurrentaddressnotonfile .underwritingdataclarity.clearfraud.clearfraudindicator.bestonfilessnissuedatecannotbeverified .underwritingdataclarity.clearfraud.clearfraudindicator.highprobabilityssnbelongstoanother  .underwritingdataclarity.clearfraud.clearfraudindicator.maxnumberofssnswithanybankaccount .underwritingdataclarity.clearfraud.clearfraudindicator.bestonfilessnrecordedasdeceased .underwritingdataclarity.clearfraud.clearfraudindicator.currentaddressreportedbynewtradeonly .underwritingdataclarity.clearfraud.clearfraudindicator.creditestablishedbeforeage18 .underwritingdataclarity.clearfraud.clearfraudindicator.telephonenumberinconsistentwithstate .underwritingdataclarity.clearfraud.clearfraudindicator.driverlicenseinconsistentwithonfile .underwritingdataclarity.clearfraud.clearfraudindicator.workphonepreviouslylistedascellphone .underwritingdataclarity.clearfraud.clearfraudindicator.workphonepreviouslylistedashomephone .underwritingdataclarity.clearfraud.clearfraudidentityverification.ssnnamematch .underwritingdataclarity.clearfraud.clearfraudidentityverification.nameaddressmatch .underwritingdataclarity.clearfraud.clearfraudidentityverification.phonematchtype .underwritingdataclarity.clearfraud.clearfraudidentityverification.ssnnamereasoncodedescription .underwritingdataclarity.clearfraud.clearfraudidentityverification.phonematchresult .underwritingdataclarity.clearfraud.clearfraudidentityverification.nameaddressreasoncodedescription .underwritingdataclarity.clearfraud.clearfraudidentityverification.phonematchtypedescription .underwritingdataclarity.clearfraud.clearfraudidentityverification.overallmatchresult .underwritingdataclarity.clearfraud.clearfraudidentityverification.phonetype .underwritingdataclarity.clearfraud.clearfraudidentityverification.ssndobreasoncode .underwritingdataclarity.clearfraud.clearfraudidentityverification.ssnnamereasoncode .underwritingdataclarity.clearfraud.clearfraudidentityverification.nameaddressreasoncode .underwritingdataclarity.clearfraud.clearfraudidentityverification.ssndobmatch  .underwritingdataclarity.clearfraud.clearfraudidentityverification.overallmatchreasoncode  clearfraudscore            underwritingid
0                                                                  8.0                                                                       2.0                                                                 2.0                                                               2.0                                                                  8.0                                                                 2.0                                                                  2.0                                                                   5.0                                                                         10.0                                                                                       False                                                                                   2.0                                                                                           True                                                                                     False                                                                         False                                                                                 True                                                                         False                                                                                       False                                                                                             False                                                                   False                                                                                     False                                                                          False                                                                                     False                                                                               False                                                                                        False                                                                                NaN                                                                              False                                                                          False                                                                                  False                                                                                          False                                                                                      False                                                                                        1.0                                                                                   False                                                                                        False                                                                                False                                                                                        False                                                                                         NaN                                                                                        False                                                                                        False                                                                           match                                                                             partial                                                                                 M                                                                                             NaN                                                                         unavailable                                                                        (A8) Match to Last Name only                                                                             (M) Mobile Phone                                                                               partial                                                                          NaN                                                                                 NaN                                                                                  NaN                                                                                       A8                                                                          match                                                                                        6.0            871.0  54cbffcee4b0ba763e43144d
1                                                                  5.0                                                                       2.0                                                                 2.0                                                               2.0                                                                 11.0                                                                 2.0                                                                  2.0                                                                   4.0                                                                         21.0                                                                                        True                                                                                   3.0                                                                                           True                                                                                     False                                                                         False                                                                                False                                                                         False                                                                                       False                                                                                             False                                                                   False                                                                                     False                                                                          False                                                                                     False                                                                               False                                                                                        False                                                                                NaN                                                                              False                                                                          False                                                                                   True                                                                                          False                                                                                      False                                                                                        1.0                                                                                   False                                                                                        False                                                                                False                                                                                        False                                                                                         NaN                                                                                        False                                                                                        False                                                                           match                                                                            mismatch                                                                                 M                                                                                             NaN                                                                         unavailable                                                                                                 NaN                                                                             (M) Mobile Phone                                                                               partial                                                                          NaN                                                                                 NaN                                                                                  NaN                                                                                      NaN                                                                          match                                                                                       11.0            397.0  54cc0408e4b0418d9a7f78af
2                                                                  9.0                                                                       4.0                                                                 2.0                                                               3.0                                                                 10.0                                                                 8.0                                                                  2.0                                                                   9.0                                                                         25.0                                                                                       False                                                                                   3.0                                                                                           True                                                                                     False                                                                         False                                                                                False                                                                         False                                                                                       False                                                                                             False                                                                   False                                                                                     False                                                                          False                                                                                     False                                                                               False                                                                                        False                                                                                NaN                                                                              False                                                                          False                                                                                  False                                                                                          False                                                                                      False                                                                                        2.0                                                                                   False                                                                                        False                                                                                False                                                                                        False                                                                                         NaN                                                                                         True                                                                                        False                                                                           match                                                                               match                                                                                 M                                                                                             NaN                                                                         unavailable                                                                                                 NaN                                                                             (M) Mobile Phone                                                                                 match                                                                          NaN                                                                                 NaN                                                                                  NaN                                                                                      NaN                                                                          match                                                                                        1.0            572.0  54cc0683e4b0418d9a80adb6
3                                                                  3.0                                                                       2.0                                                                 2.0                                                               2.0                                                                  9.0                                                                 2.0                                                                  2.0                                                                   2.0                                                                          9.0                                                                                       False                                                                                   1.0                                                                                           True                                                                                     False                                                                         False                                                                                False                                                                         False                                                                                       False                                                                                             False                                                                   False                                                                                     False                                                                          False                                                                                     False                                                                               False                                                                                        False                                                                                NaN                                                                              False                                                                          False                                                                                  False                                                                                          False                                                                                      False                                                                                        1.0                                                                                   False                                                                                        False                                                                                False                                                                                        False                                                                                         NaN                                                                                        False                                                                                        False                                                                           match                                                                            mismatch                                                                                 M                                                                                             NaN                                                                         unavailable                                                                                                 NaN                                                                             (M) Mobile Phone                                                                               partial                                                                          NaN                                                                                 NaN                                                                                  NaN                                                                                      NaN                                                                          match                                                                                       11.0            838.0  54cc0780e4b0ba763e43b74a
4                                                                  5.0                                                                       5.0                                                                 2.0                                                               2.0                                                                  6.0                                                                 5.0                                                                  2.0                                                                   5.0                                                                          6.0                                                                                       False                                                                                   1.0                                                                                           True                                                                                     False                                                                         False                                                                                False                                                                         False                                                                                       False                                                                                             False                                                                   False                                                                                     False                                                                          False                                                                                     False                                                                               False                                                                                        False                                                                                NaN                                                                              False                                                                          False                                                                                  False                                                                                          False                                                                                      False                                                                                        1.0                                                                                   False                                                                                        False                                                                                False                                                                                        False                                                                                         NaN                                                                                        False                                                                                        False                                                                           match                                                                               match                                                                                 M                                                                                             NaN                                                                         unavailable                                                                                                 NaN                                                                             (M) Mobile Phone                                                                                 match                                                                          NaN                                                                                 NaN                                                                                  NaN                                                                                      NaN                                                                          match                                                                                        1.0            768.0  54cc1d67e4b0ba763e445b45
show code
```{python}
#display payment
print(payment.head().to_string()) 
```
          loanId  installmentIndex  isCollection          paymentDate  principal    fees  paymentAmount paymentStatus paymentReturnCode
0  LL-I-00000021                 1         False  2014-12-19T05:00:00      22.33  147.28         169.61       Checked               NaN
1  LL-I-00000021                 2         False  2015-01-02T05:00:00      26.44  143.17         169.61       Checked               NaN
2  LL-I-00000021                 3         False  2015-01-16T05:00:00      31.30  138.31         169.61       Checked               NaN
3  LL-I-00000021                 4         False  2015-01-30T05:00:00      37.07  132.54         169.61       Checked               NaN
4  LL-I-00000021                 5         False  2015-02-13T05:00:00      43.89  125.72         169.61       Checked               NaN
show code
```{python}
print (f"Payment Rows: {len(payment):,}, Loan Rows: {len(loan):,}")
```
Payment Rows: 689,364, Loan Rows: 577,682

2.1.1 Summary Statistic

Frequency table for numerical variables
Frequency table for categorical variable
show code
```{python}
# summary table for payment.csv
def describe_full(df):
   numeric_stats = df.describe()
   
   categorical_stats = {col: df[col].value_counts() 
                       for col in df.select_dtypes(include=['object', 'bool']).columns}
   
   print("Numeric Statistics:")
   print(numeric_stats)
   print("\nCategorical Counts:")
   for col, counts in categorical_stats.items():
       print(f"\n{col}:")
       print(counts)

describe_full(loan)
```
Numeric Statistics:
                 apr       nPaidOff       isFunded     loanAmount  \
count  573760.000000  577658.000000  577682.000000  575432.000000   
mean      553.080972       0.037887       0.067480     514.245084   
std       110.046159       0.333366       0.250852     320.939929   
min         0.000000       0.000000       0.000000       0.000000   
25%       490.000000       0.000000       0.000000     350.000000   
50%       590.000000       0.000000       0.000000     500.000000   
75%       601.000000       0.000000       0.000000     500.000000   
max       705.590000      21.000000       1.000000    5000.000000   

       originallyScheduledPaymentAmount       leadCost          hasCF  
count                     577682.000000  577682.000000  577682.000000  
mean                        1428.897209       7.854389       0.619187  
std                          925.009141      12.853451       0.485587  
min                         -816.710000       0.000000       0.000000  
25%                         1023.640000       3.000000       0.000000  
50%                         1245.250000       3.000000       1.000000  
75%                         1615.660000       6.000000       1.000000  
max                        19963.630000     200.000000       1.000000  

Categorical Counts:

loanId:
loanId
LL-I-07399092    1
LL-I-09120818    1
LL-I-12495837    1
LL-I-09315207    1
LL-I-07105524    1
                ..
LL-I-18326270    1
LL-I-14232965    1
LL-I-06403601    1
LL-I-10713224    1
LL-I-04733921    1
Name: count, Length: 577426, dtype: int64

anon_ssn:
anon_ssn
c8bb49de1f8ff99d2ecddfb7037dc66e    35
0b87684b60c8b8f5d0bd40eb5811cd50    20
f971898a3c70e4c3c316cb4bc48b6e2e    18
6f0e71bc16f634fafcf5337cecdfb25e    16
64cd38e9cd1e06ed480b5e607494ce57    15
                                    ..
9e28174874e0ddfaab268a6fda20f419     1
ba2f64a44344fcf8b5f0c4076bb6777f     1
ebe6ec6901aa8f9f59520ac4e65e0142     1
d3c1943189c453ce5149dc5c7c921f06     1
d7e55e85266208ac4c353f42ebcde5ca     1
Name: count, Length: 459393, dtype: int64

payFrequency:
payFrequency
B    316654
W    137188
M     59092
I     33787
S     29688
Name: count, dtype: int64

applicationDate:
applicationDate
2017-01-03T18:05:40.811000    3
2016-08-19T14:07:17.373000    2
2016-08-11T03:12:23.616000    2
2016-09-12T16:56:05.319000    2
2016-11-29T23:27:55.525000    2
                             ..
2016-08-30T23:15:03.936000    1
2015-10-16T07:06:51.886000    1
2016-08-03T00:56:19.972000    1
2016-06-22T17:48:34.161000    1
2015-11-17T22:04:20.862000    1
Name: count, Length: 577624, dtype: int64

originated:
originated
False    531676
True      46006
Name: count, dtype: int64

originatedDate:
originatedDate
2017-03-06T22:29:45.066000    2
2015-02-20T19:40:33.329000    2
2016-01-20T15:49:18.846000    1
2015-12-13T18:52:44.655000    1
2017-03-18T19:15:07.210000    1
                             ..
2017-03-03T18:00:09.560000    1
2017-03-14T13:45:08.746000    1
2016-07-03T20:12:54.603000    1
2016-01-11T17:12:16.285000    1
2015-11-12T22:35:47.644000    1
Name: count, Length: 46042, dtype: int64

approved:
approved
False    537646
True      40036
Name: count, dtype: int64

loanStatus:
loanStatus
Withdrawn Application          450984
Rejected                        85070
Paid Off Loan                   11427
External Collection             11334
New Loan                         8112
Internal Collection              5567
Returned Item                    1182
CSR Voided New Loan              1026
Settlement Paid Off               708
Credit Return Void                704
Customer Voided New Loan          504
Settled Bankruptcy                326
Pending Paid Off                  169
Charged Off Paid Off              160
Pending Application Fee             5
Pending Rescind                     4
Pending Application                 4
Voided New Loan                     2
Charged Off                         1
Customver Voided New Loan           1
Settlement Pending Paid Off         1
Name: count, dtype: int64

state:
state
OH         90496
IL         66710
TX         49729
MO         49249
WI         40349
MI         34635
TN         32991
NC         26962
FL         25896
IN         25781
SC         23925
CA         22691
NV         11477
PA          9723
VA          9197
NJ          7641
UT          6742
AL          4327
MS          3786
CO          3615
LA          3431
AZ          3355
NM          3231
KY          2911
SD          2212
MN          2078
GA          1763
OK          1758
WY          1715
CT          1468
WA          1384
KS          1227
IA          1043
DE           880
ID           722
RI           689
NE           635
HI           621
AK           290
ND           207
NY             3
MD             3
OH-TEST        1
TX-TEST        1
Name: count, dtype: int64

leadType:
leadType
bvMandatory      475001
lead              72673
organic           22851
prescreen          4515
rc_returning       2069
california          479
lionpay              26
repeat               24
instant-offer        22
express              22
Name: count, dtype: int64

fpStatus:
fpStatus
Checked        32978
No Payments    11427
Rejected        5522
No Schedule     1323
Cancelled        249
Skipped          218
Pending            5
Returned           1
Name: count, dtype: int64

clarityFraudId:
clarityFraudId
561e95f7e4b0efa8a6cdc975    15
5675c05de4b08b757e32e3b4    14
583d9d11e4b0bb9d66559063    13
581101f3e4b04db4de10a616    12
5666318be4b0100e879ac2f0    10
                            ..
579a6118e4b0d050242220df     1
577e4a29e4b06e5979673c41     1
57dad980e4b01a79d8b38492     1
5748eaf5e4b061d2b9c3f275     1
564f801ae4b05a02bef951e8     1
Name: count, Length: 314915, dtype: int64
show code
```{python}
# Creating a frequency table function
def freq(df, column):
   freq = df[column].value_counts()
   valid_perc = df[column].value_counts(normalize=True) * 100
   total_perc = df[column].value_counts(dropna=False, normalize=True) * 100
   
   table = pd.DataFrame({
       'Freq': freq,
       '% Valid': valid_perc,
       '% Valid Cum.': valid_perc.cumsum(),
       '% Total': total_perc,
       '% Total Cum.': total_perc.cumsum()
   }).round(2)
   
   # Add total row
   total_row = pd.DataFrame({
       'Freq': [len(df)],
       '% Valid': [100],
       '% Valid Cum.': [100],
       '% Total': [100],
       '% Total Cum.': [100]
   }, index=['Total'])
   print(f"--------------- `{column}` Frequency Table ---------------")
   return pd.concat([table, total_row])
freq(loan, 'approved')
```
--------------- `approved` Frequency Table ---------------
Freq % Valid % Valid Cum. % Total % Total Cum.
False 537646 93.07 93.07 93.07 93.07
True 40036 6.93 100.00 6.93 100.00
Total 577682 100.00 100.00 100.00 100.00
show code
```{python}
freq(loan, 'originated')
```
--------------- `originated` Frequency Table ---------------
Freq % Valid % Valid Cum. % Total % Total Cum.
False 531676 92.04 92.04 92.04 92.04
True 46006 7.96 100.00 7.96 100.00
Total 577682 100.00 100.00 100.00 100.00
show code
```{python}
freq(loan, 'loanStatus')
```
--------------- `loanStatus` Frequency Table ---------------
Freq % Valid % Valid Cum. % Total % Total Cum.
CSR Voided New Loan 1026.0 0.18 99.55 0.18 99.48
Charged Off 1.0 0.00 100.00 0.00 100.00
Charged Off Paid Off 160.0 0.03 100.00 0.03 100.00
Credit Return Void 704.0 0.12 99.80 0.12 99.73
Customer Voided New Loan 504.0 0.09 99.88 0.09 99.82
Customver Voided New Loan 1.0 0.00 100.00 0.00 100.00
External Collection 11334.0 1.96 96.80 1.96 96.73
Internal Collection 5567.0 0.96 99.17 0.96 99.10
New Loan 8112.0 1.41 98.20 1.40 98.14
Paid Off Loan 11427.0 1.98 94.84 1.98 94.77
Pending Application 4.0 0.00 100.00 0.00 100.00
Pending Application Fee 5.0 0.00 100.00 0.00 100.00
Pending Paid Off 169.0 0.03 99.97 0.03 99.97
Pending Rescind 4.0 0.00 100.00 0.00 100.00
Rejected 85070.0 14.74 92.86 14.73 92.79
Returned Item 1182.0 0.20 99.37 0.20 99.31
Settled Bankruptcy 326.0 0.06 99.94 0.06 99.94
Settlement Paid Off 708.0 0.12 99.67 0.12 99.61
Settlement Pending Paid Off 1.0 0.00 100.00 0.00 100.00
Voided New Loan 2.0 0.00 100.00 0.00 100.00
Withdrawn Application 450984.0 78.12 78.12 78.07 78.07
NaN NaN NaN NaN 0.07 99.88
Total 577682.0 100.00 100.00 100.00 100.00
show code
```{python}
freq(loan, 'leadType')
```
--------------- `leadType` Frequency Table ---------------
Freq % Valid % Valid Cum. % Total % Total Cum.
bvMandatory 475001 82.23 82.23 82.23 82.23
lead 72673 12.58 94.81 12.58 94.81
organic 22851 3.96 98.76 3.96 98.76
prescreen 4515 0.78 99.54 0.78 99.54
rc_returning 2069 0.36 99.90 0.36 99.90
california 479 0.08 99.98 0.08 99.98
lionpay 26 0.00 99.99 0.00 99.99
repeat 24 0.00 99.99 0.00 99.99
instant-offer 22 0.00 100.00 0.00 100.00
express 22 0.00 100.00 0.00 100.00
Total 577682 100.00 100.00 100.00 100.00
show code
```{python}
freq(loan, 'leadType')
```
--------------- `leadType` Frequency Table ---------------
Freq % Valid % Valid Cum. % Total % Total Cum.
bvMandatory 475001 82.23 82.23 82.23 82.23
lead 72673 12.58 94.81 12.58 94.81
organic 22851 3.96 98.76 3.96 98.76
prescreen 4515 0.78 99.54 0.78 99.54
rc_returning 2069 0.36 99.90 0.36 99.90
california 479 0.08 99.98 0.08 99.98
lionpay 26 0.00 99.99 0.00 99.99
repeat 24 0.00 99.99 0.00 99.99
instant-offer 22 0.00 100.00 0.00 100.00
express 22 0.00 100.00 0.00 100.00
Total 577682 100.00 100.00 100.00 100.00
show code
```{python}
nPaidOff_mean = round(loan['nPaidOff'].dropna().mean() * 100, 2)
print(f"Average pay off rate: {nPaidOff_mean}")
```
Average pay off rate: 3.79

loan.csv shows a high-risk lending operation due to very low approval where 40,036 loan was approved out of 577,682 it has 6.93% approval rate. Although relatively high APR (median of 590%), only 7.96% Originated from applications. High APRs, low approval rates, low originated rates, and low pay off rate suggesting this is a high-risk lending operation indicate majority are subprime borrower base with significant credit risk.

show code
```{python}
loan_numeric = loan[['originallyScheduledPaymentAmount', 'loanAmount', 'apr']]
paymentAmount_max = loan_numeric['originallyScheduledPaymentAmount'].max()
loanAmount_max = loan_numeric['loanAmount'].dropna().max()

fig = go.Figure()

for col, color, name in [
        ('loanAmount', 'salmon', 'Loan Amount'),
        ('originallyScheduledPaymentAmount', 'turquoise', 'Originally Scheduled Payment Amount')
                        ]:
        fig.add_trace(go.Histogram(
                x=loan_numeric[col].dropna(),
                name=name,
                opacity=0.5,
                marker_color=color,
                nbinsx=30
))

for val, color, name in [
        (loanAmount_max, 'salmon', 'Max Loan'),
        (paymentAmount_max, 'turquoise', 'Max Payment')
]:
        fig.add_vline(
                x=val,
                line_dash="dash",
                line_color=color,
                annotation_text=f"${val:,.2f}",
                annotation_position="top"
        )       

        fig.update_layout(
        barmode='overlay',
        xaxis_title='Dollars (USD)',
        showlegend=True,
        legend=dict(
                yanchor="top",
                y=0.99,
                xanchor="left",
                x=1.05
        ),
        margin=dict(r=150)      
)

fig.show()
```

Assuming this is in USD (since all the states mentioned are from United States). With loan amounts generally being higher than scheduled payments. The dashed line indicate the maximum for loan amount and payment amount. Maximum loan is $5,000 while maximum original payment shows the 4x multiplier from interest and fees resulting in $19,963.63 payment amount.

show code
```{python}
freq(loan, 'payFrequency')
```
--------------- `payFrequency` Frequency Table ---------------
Freq % Valid % Valid Cum. % Total % Total Cum.
B 316654.0 54.94 54.94 54.81 54.81
I 33787.0 5.86 94.85 5.85 94.64
M 59092.0 10.25 88.99 10.23 88.79
S 29688.0 5.15 100.00 5.14 99.78
W 137188.0 23.80 78.74 23.75 78.56
NaN NaN NaN NaN 0.22 100.00
Total 577682.0 100.00 100.00 100.00 100.00

Majority of 54.94% or 316,654 loans were paid biweekly (B) where least common were paid semi-monthly (S). However, there are mising data in this variable. lets investigate!

show code
```{python}
# This might be used later, hence i create this function 
def view_missing_frequency(df, column):
   missing_freq = df[df[column].isna()].head()
   print(missing_freq.to_string())
   
view_missing_frequency(loan, 'payFrequency')
```
             loanId                          anon_ssn payFrequency    apr             applicationDate  originated originatedDate  nPaidOff  approved  isFunded             loanStatus  loanAmount  originallyScheduledPaymentAmount state leadType  leadCost     fpStatus            clarityFraudId  hasCF
214   LL-I-08119905  8d94d36f11e4b203189e1694b52b6f61          NaN  590.0  2016-03-25T17:32:23.432000       False            NaN       0.0     False         0  Withdrawn Application       300.0                               0.0    OH  organic         0  No Payments  56f57622e4b030e93c770970      1
821   LL-I-08854991  dd5291dc42c18b5d97b6747cf5102fd4          NaN  590.0  2016-04-19T23:20:40.375000       False            NaN       0.0     False         0  Withdrawn Application       300.0                               0.0    IN  organic         0  No Payments  5716bd44e4b0cd57626777c6      1
1461  LL-I-11401749  9d2851e8d30add57fb6cc6722e0f25d7          NaN  590.0  2016-08-24T00:46:42.965000       False            NaN       0.0     False         0               Rejected       300.0                               0.0    TN  organic         0  No Payments  57bcee6ee4b033e6e6541c1a      1
1961  LL-I-11826560  40a58015111e3d9599975322aff494c1          NaN  590.0  2016-09-07T11:49:25.009000       False            NaN       0.0     False         0               Rejected       300.0                               0.0    CA  organic         0  No Payments  57ce6808e4b092499fe7209e      1
2136  LL-I-16240119  fad43a089c335acb76aa6b9aaf111a0a          NaN  645.0  2017-01-25T22:59:58.767000       False            NaN       0.0     False         0  Withdrawn Application       500.0                               0.0    KY  organic         0  No Payments  58548f8ee4b08f245ec11450      1

Apparently, the first 5 approved variable is false, lets call all the unique values in payfrequency.

show code
```{python}
unique_values = loan[loan['payFrequency'].isna()]['approved'].unique()
print(f"All approved values after filtering pay frequency: {unique_values}")
```
All approved values after filtering pay frequency: [False]

So it seems all NA from pay frequency is a result of unapproved loan.

Summary Statistic of payment.csv
show code
```{python}
describe_full(payment)
```
Numeric Statistics:
       installmentIndex      principal           fees  paymentAmount
count     689364.000000  689364.000000  689364.000000  689364.000000
mean          10.553222      45.557543      67.003994     112.680232
std            8.049530      81.724683      59.789510     105.783710
min            1.000000    -303.370000     -42.560000    -337.700000
25%            5.000000      13.180000      28.820000      56.810000
50%            9.000000      27.610000      51.300000      86.340000
75%           14.000000      53.380000      86.440000     135.090000
max          105.000000    4000.000000    1257.710000    4063.600000

Categorical Counts:

loanId:
loanId
LL-I-12230332    105
LL-I-08320897    105
LL-I-17454600    105
LL-I-09757835    105
LL-I-12217217    104
                ... 
LL-I-01714117      6
LL-I-15577134      6
LL-I-16759210      6
LL-I-01726796      6
LL-I-01138268      3
Name: count, Length: 39952, dtype: int64

isCollection:
isCollection
False    675469
True      13895
Name: count, dtype: int64

paymentDate:
paymentDate
2017-03-31T04:00:00           10236
2017-04-28T04:00:00            9336
2017-04-14T04:00:00            9138
2017-03-24T04:00:00            8942
2017-04-07T04:00:00            8899
                              ...  
2017-03-09T13:13:48.688000        1
2016-04-05T16:57:07.031000        1
2016-03-04T19:08:06.944000        1
2016-03-10T20:01:30.769000        1
2019-03-13T04:00:00               1
Name: count, Length: 18436, dtype: int64

paymentStatus:
paymentStatus
Cancelled                  270334
Checked                    209621
Rejected                    32330
Pending                      9241
Skipped                      3761
Rejected Awaiting Retry        18
Returned                        1
Complete                        1
Name: count, dtype: int64

paymentReturnCode:
paymentReturnCode
R01       22866
R02        2761
R08        2259
R16        1085
R10         620
MISSED      537
R03         318
RXS         226
R09         176
R07         160
C05         106
C01          87
R20          83
R99          60
RAF          58
R04          39
C03          34
C02          10
RWC           7
LPP01         7
R06           6
RUP           6
RBW           5
R29           4
RFG           3
R15           3
R13           2
C07           2
RIR           1
R19           1
RXL           1
Name: count, dtype: int64
show code
```{python}
freq(payment, 'isCollection')
```
--------------- `isCollection` Frequency Table ---------------
Freq % Valid % Valid Cum. % Total % Total Cum.
False 675469 97.98 97.98 97.98 97.98
True 13895 2.02 100.00 2.02 100.00
Total 689364 100.00 100.00 100.00 100.00
show code
```{python}
freq(payment,'paymentReturnCode')
```
--------------- `paymentReturnCode` Frequency Table ---------------
Freq % Valid % Valid Cum. % Total % Total Cum.
C01 87.0 0.28 98.95 0.01 99.95
C02 10.0 0.03 99.85 0.00 99.99
C03 34.0 0.11 99.82 0.00 99.99
C05 106.0 0.34 98.67 0.02 99.94
C07 2.0 0.01 99.99 0.00 100.00
LPP01 7.0 0.02 99.89 0.00 100.00
MISSED 537.0 1.70 95.54 0.08 99.80
R01 22866.0 72.51 72.51 3.32 98.74
R02 2761.0 8.76 81.27 0.40 99.14
R03 318.0 1.01 96.55 0.05 99.84
R04 39.0 0.12 99.71 0.01 99.99
R06 6.0 0.02 99.91 0.00 100.00
R07 160.0 0.51 98.34 0.02 99.92
R08 2259.0 7.16 88.43 0.33 99.47
R09 176.0 0.56 97.83 0.03 99.90
R10 620.0 1.97 93.84 0.09 99.72
R13 2.0 0.01 99.98 0.00 100.00
R15 3.0 0.01 99.98 0.00 100.00
R16 1085.0 3.44 91.88 0.16 99.63
R19 1.0 0.00 100.00 0.00 100.00
R20 83.0 0.26 99.21 0.01 99.96
R29 4.0 0.01 99.96 0.00 100.00
R99 60.0 0.19 99.40 0.01 99.97
RAF 58.0 0.18 99.58 0.01 99.98
RBW 5.0 0.02 99.95 0.00 100.00
RFG 3.0 0.01 99.97 0.00 100.00
RIR 1.0 0.00 99.99 0.00 100.00
RUP 6.0 0.02 99.93 0.00 100.00
RWC 7.0 0.02 99.87 0.00 99.99
RXL 1.0 0.00 100.00 0.00 100.00
RXS 226.0 0.72 97.27 0.03 99.88
NaN NaN NaN NaN 95.43 95.43
Total 689364.0 100.00 100.00 100.00 100.00
show code
```{python}
freq(payment,'paymentReturnCode')
```
--------------- `paymentReturnCode` Frequency Table ---------------
Freq % Valid % Valid Cum. % Total % Total Cum.
C01 87.0 0.28 98.95 0.01 99.95
C02 10.0 0.03 99.85 0.00 99.99
C03 34.0 0.11 99.82 0.00 99.99
C05 106.0 0.34 98.67 0.02 99.94
C07 2.0 0.01 99.99 0.00 100.00
LPP01 7.0 0.02 99.89 0.00 100.00
MISSED 537.0 1.70 95.54 0.08 99.80
R01 22866.0 72.51 72.51 3.32 98.74
R02 2761.0 8.76 81.27 0.40 99.14
R03 318.0 1.01 96.55 0.05 99.84
R04 39.0 0.12 99.71 0.01 99.99
R06 6.0 0.02 99.91 0.00 100.00
R07 160.0 0.51 98.34 0.02 99.92
R08 2259.0 7.16 88.43 0.33 99.47
R09 176.0 0.56 97.83 0.03 99.90
R10 620.0 1.97 93.84 0.09 99.72
R13 2.0 0.01 99.98 0.00 100.00
R15 3.0 0.01 99.98 0.00 100.00
R16 1085.0 3.44 91.88 0.16 99.63
R19 1.0 0.00 100.00 0.00 100.00
R20 83.0 0.26 99.21 0.01 99.96
R29 4.0 0.01 99.96 0.00 100.00
R99 60.0 0.19 99.40 0.01 99.97
RAF 58.0 0.18 99.58 0.01 99.98
RBW 5.0 0.02 99.95 0.00 100.00
RFG 3.0 0.01 99.97 0.00 100.00
RIR 1.0 0.00 99.99 0.00 100.00
RUP 6.0 0.02 99.93 0.00 100.00
RWC 7.0 0.02 99.87 0.00 99.99
RXL 1.0 0.00 100.00 0.00 100.00
RXS 226.0 0.72 97.27 0.03 99.88
NaN NaN NaN NaN 95.43 95.43
Total 689364.0 100.00 100.00 100.00 100.00
show code
```{python}
freq(payment,'paymentStatus')
```
--------------- `paymentStatus` Frequency Table ---------------
Freq % Valid % Valid Cum. % Total % Total Cum.
Cancelled 270334.0 51.46 51.46 39.21 39.21
Checked 209621.0 39.90 91.37 30.41 69.62
Complete 1.0 0.00 100.00 0.00 100.00
Pending 9241.0 1.76 99.28 1.34 99.45
Rejected 32330.0 6.15 97.52 4.69 98.11
Rejected Awaiting Retry 18.0 0.00 100.00 0.00 100.00
Returned 1.0 0.00 100.00 0.00 100.00
Skipped 3761.0 0.72 100.00 0.55 100.00
NaN NaN NaN NaN 23.80 93.42
Total 689364.0 100.00 100.00 100.00 100.00

It seems unusual how, principal, fees and payment amount has a negative value. Although, this could be because of data entry errors, refunds or reversals and system errors in calculation.

show code
```{python}
fees_max = payment['fees'].max()
payamount_max = payment['paymentAmount'].max()
principal_max = payment['principal'].max()

fig = go.Figure()

for col, color in [('principal', 'lightblue'), 
                    ('paymentAmount', 'turquoise'),
                    ('fees', 'salmon')]:
    fig.add_trace(go.Histogram(
        x=payment[col],
        name=col,
        opacity=0.5,
        marker_color=color,
        nbinsx=30
    ))

# Add vertical lines
for val, color, name in [(fees_max, 'salmon', 'Max Fees'), 
                        (payamount_max, 'turquoise', 'Max Payment'),
                        (principal_max, 'lightblue', 'Max Principal')]:
    fig.add_vline(x=val, 
                line_dash="dash",
                line_color=color,
                annotation_text=f"${val:,.2f}",
                annotation_position="top")

fig.update_layout(
    barmode='overlay',
    title='Payment Distributions',
    xaxis_title='Amount',
    showlegend=True,
    legend=dict(
        yanchor="top",
        y=0.99,
        xanchor="left",
        x=1.05
    ),
    margin=dict(r=150)
)

fig.show()
```
show code
```{python}
def plot_installments(df):
   installments = df.groupby('loanId').size().reset_index(name='count')
   
   fig = go.Figure()
   
   # Add histogram
   fig.add_trace(go.Histogram(
       x=installments['count'],
       nbinsx=30,
       marker_color='lightblue'
   ))

   max_val = df['installmentIndex'].max()
   median_val = df['installmentIndex'].median()
   
   for val, text in [(max_val, 'Maximum'), (median_val, 'Median')]:
       fig.add_vline(
           x=val,
           line_dash="dash",
           annotation_text=f"{text}: {val:.0f}",
           annotation_position="top"
       )

   fig.update_layout(
       title='Distribution of Total Installments per Loan',
       xaxis_title='Number of Installments',
       showlegend=False
   )

   # Print summary statistics
   summary = df['installmentIndex'].describe()
   print("\nInstallment Index Summary:")
   print(summary)
   
   return fig
```

The maximum installment is 105 installments, but it’s rare - the distribution is heavily right-skewed, suggesting most borrowers have shorter-term repayment schedules where most loans have less than 25 installments payments. The average installments per loan is repaid per 10 installment while the median is 9.

show code
```{python}
def plot_trends(df, min_installments=60):
   # Get loans with installments above threshold
   top_loans = (df.groupby('loanId')['installmentIndex']
               .max()
               .reset_index()
               .query(f'installmentIndex > {min_installments}'))
   
   # Filter payments for those loans
   filtered_payments = df[df['loanId'].isin(top_loans['loanId'])]
   
   fig = px.line(filtered_payments, 
                 x='paymentDate', 
                 y='fees',
                 color='loanId',
                 title=f'Installments >= {min_installments}')
   
   fig.update_layout(showlegend=False)
   
   return fig

plot_trends(payment,50)
```

The graph tracks fee patterns for loans with 60 or more installments from this dataset. Fees generally decrease over time from $50-150 initial amounts, but show sharp spikes likely representing late payment penalties, collection charges, or loan restructuring events. Most loans follow similar declining patterns despite varying terms indicating the end of their loan term.

2.1.2 Missing Values

show code
```{python}
# missing values for loan.csv
def missing_values(df):
   missingno.matrix(df)
   plt.title('Missing Values in Dataset')
   plt.show()
   
   missing_pct = df.isnull().sum() * 100 / len(df)
   return missing_pct.sort_values(ascending=False)

missing_values(loan)
```

originatedDate                      92.029525
fpStatus                            91.046458
clarityFraudId                      38.081332
apr                                  0.678920
loanAmount                           0.389488
payFrequency                         0.220363
loanStatus                           0.067684
loanId                               0.044315
state                                0.022850
nPaidOff                             0.004155
leadCost                             0.000000
leadType                             0.000000
isFunded                             0.000000
originallyScheduledPaymentAmount     0.000000
anon_ssn                             0.000000
approved                             0.000000
originated                           0.000000
applicationDate                      0.000000
hasCF                                0.000000
dtype: float64
show code
```{python}
missing_values(clarity)
```

.underwritingdataclarity.clearfraud.clearfraudidentityverification.phonetype                           96.954896
.underwritingdataclarity.clearfraud.clearfraudidentityverification.ssnnamereasoncodedescription        94.635392
.underwritingdataclarity.clearfraud.clearfraudidentityverification.ssnnamereasoncode                   94.635392
.underwritingdataclarity.clearfraud.clearfraudidentityverification.nameaddressreasoncodedescription    88.689902
.underwritingdataclarity.clearfraud.clearfraudidentityverification.nameaddressreasoncode               88.689902
.underwritingdataclarity.clearfraud.clearfraudidentityverification.ssndobreasoncode                    81.851986
.underwritingdataclarity.clearfraud.clearfraudindicator.driverlicenseinconsistentwithonfile            79.789757
.underwritingdataclarity.clearfraud.clearfraudindicator.workphonepreviouslylistedashomephone           56.954494
.underwritingdataclarity.clearfraud.clearfraudindicator.workphonepreviouslylistedascellphone           56.954494
.underwritingdataclarity.clearfraud.clearfraudindicator.driverlicenseformatinvalid                     10.148336
.underwritingdataclarity.clearfraud.clearfraudidentityverification.phonematchtype                       1.915501
.underwritingdataclarity.clearfraud.clearfraudidentityverification.phonematchtypedescription            1.915501
.underwritingdataclarity.clearfraud.clearfraudindicator.telephonenumberinconsistentwithstate            1.368789
clearfraudscore                                                                                         0.275366
.underwritingdataclarity.clearfraud.clearfraudidentityverification.phonematchresult                     0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.creditestablishedpriortossnissuedate            0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.creditestablishedbeforeage18                    0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.currentaddressreportedbynewtradeonly            0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.bestonfilessnrecordedasdeceased                 0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.highprobabilityssnbelongstoanother              0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.bestonfilessnissuedatecannotbeverified          0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.inquiryaddresshighrisk                          0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.inputssnrecordedasdeceased                      0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.inquirycurrentaddressnotonfile                  0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.morethan3inquiriesinthelast30days               0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.onfileaddresshighrisk                           0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.inquiryonfilecurrentaddressconflict             0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.telephonenumberinconsistentwithaddress          0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.inquiryageyoungerthanssnissuedate               0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.inquiryaddresscautious                          0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.inquiryaddressnonresidential                    0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.onfileaddresscautious                           0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.ssnreportedmorefrequentlyforanother             0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.currentaddressreportedbytradeopenlt90days       0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.inputssninvalid                                 0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.inputssnissuedatecannotbeverified               0.080399
.underwritingdataclarity.clearfraud.clearfraudindicator.onfileaddressnonresidential                     0.080399
.underwritingdataclarity.clearfraud.clearfraudidentityverification.overallmatchresult                   0.064319
.underwritingdataclarity.clearfraud.clearfraudidentityverification.ssnnamematch                         0.064319
.underwritingdataclarity.clearfraud.clearfraudidentityverification.nameaddressmatch                     0.064319
.underwritingdataclarity.clearfraud.clearfraudidentityverification.ssndobmatch                          0.064319
.underwritingdataclarity.clearfraud.clearfraudidentityverification.overallmatchreasoncode               0.064319
.underwritingdataclarity.clearfraud.clearfraudindicator.maxnumberofssnswithanybankaccount               0.034169
.underwritingdataclarity.clearfraud.clearfraudindicator.totalnumberoffraudindicators                    0.034169
.underwritingdataclarity.clearfraud.clearfraudinquiry.thirtydaysago                                     0.004020
.underwritingdataclarity.clearfraud.clearfraudinquiry.twentyfourhoursago                                0.004020
.underwritingdataclarity.clearfraud.clearfraudinquiry.threesixtyfivedaysago                             0.004020
.underwritingdataclarity.clearfraud.clearfraudinquiry.fifteendaysago                                    0.004020
.underwritingdataclarity.clearfraud.clearfraudinquiry.tenminutesago                                     0.004020
.underwritingdataclarity.clearfraud.clearfraudinquiry.sevendaysago                                      0.004020
.underwritingdataclarity.clearfraud.clearfraudinquiry.ninetydaysago                                     0.004020
.underwritingdataclarity.clearfraud.clearfraudinquiry.onehourago                                        0.004020
.underwritingdataclarity.clearfraud.clearfraudinquiry.oneminuteago                                      0.004020
underwritingid                                                                                          0.000000
dtype: float64
show code
```{python}
missing_values(payment)
```

paymentReturnCode    95.425784
paymentStatus        23.798313
loanId                0.000000
installmentIndex      0.000000
isCollection          0.000000
paymentDate           0.000000
principal             0.000000
fees                  0.000000
paymentAmount         0.000000
dtype: float64

Unfortunately 4 columns in the dataset have more missing values than not. Before we explore further, we need to understand the pattern of these NA values. Firstly for payment.csv, majority of the missing values lies on the paymentReturnCode.

show code
```{python}
def print_distinct(df,column):
    unique_codes = df[column].unique()
    chunks = [unique_codes[i:i+5] for i in range(0, len(unique_codes), 5)]

    print(f"-------------- {column} --------------")
    for chunk in chunks:
        print("|", " | ".join(str(code) for code in chunk),"|")
    
print_distinct(payment, 'paymentReturnCode')
```
-------------- paymentReturnCode --------------
| nan | R01 | R08 | R02 | R16 |
| RBW | R03 | R20 | R09 | R10 |
| R07 | C03 | C05 | R06 | R04 |
| R99 | C01 | RXS | C02 | RXL |
| R15 | RAF | RFG | RWC | RIR |
| MISSED | LPP01 | R13 | R29 | R19 |
| RUP | C07 |
show code
```{python}
print_distinct(payment,'paymentStatus')
```
-------------- paymentStatus --------------
| Checked | Rejected | Cancelled | nan | Skipped |
| Returned | Pending | Rejected Awaiting Retry | Complete |
show code
```{python}
def count_codes(payment):
   for status in payment['paymentStatus'].unique():
       count = payment[
           (payment['paymentStatus'] == status) & 
           (payment['paymentReturnCode'].notna())
       ].shape[0]
       print(f"{status}: {count}")


count_codes(payment)
```
Checked: 241
Rejected: 31290
Cancelled: 1
nan: 0
Skipped: 0
Returned: 1
Pending: 0
Rejected Awaiting Retry: 0
Complete: 0

Note that, payment return code only occurs when the paymentStatus is rejected, checked, cancelled, or returned. Resulting in a code which explains the reject. This is a clear example of MAR (Missing At Random). We can fully explain why the data is missing based on paymentStatus. The same goes to originatedDate and fpStatus. originatedDate only exist if the loan has been originated and fpStatus returns NA if no ACH attempt has been made yet according to the data dictionary.

show code
```{python}
# Count unique loan IDs
unique_payment_loans = payment['loanId'].nunique()
unique_loan_loans = loan['loanId'].nunique()
n = 10
# Get top 5 loan frequencies
payment_counts = (payment['loanId'].value_counts()
                .reset_index()
                .rename(columns={'index': 'loanId', 'loanId': 'count'})
                .head(n))

loan_counts = (loan['loanId'].value_counts()
                .reset_index()
                .rename(columns={'index': 'loanId', 'loanId': 'count'})
                .head(n))
   
print(f"Top 5 loan frequencies in payment data:\n{payment_counts}\n")
print(f"Top 5 loan frequencies in loan data:\n{loan_counts}\n")
print(f"Unique loanId in payment.csv: {unique_payment_loans:,} out of {len(payment):,}")
print(f"Unique loanId in loan.csv: {unique_loan_loans:,} out of {len(loan):,}")
```
Top 5 loan frequencies in payment data:
           count  count
0  LL-I-12230332    105
1  LL-I-08320897    105
2  LL-I-17454600    105
3  LL-I-09757835    105
4  LL-I-12217217    104
5  LL-I-12335887    104
6  LL-I-10000925    104
7  LL-I-10673366    104
8  LL-I-18166243    104
9  LL-I-10265149    104

Top 5 loan frequencies in loan data:
           count  count
0  LL-I-07399092      1
1  LL-I-09120818      1
2  LL-I-12495837      1
3  LL-I-09315207      1
4  LL-I-07105524      1
5  LL-I-08678272      1
6  LL-I-07517765      1
7  LL-I-12108022      1
8  LL-I-06703521      1
9  LL-I-18239150      1

Unique loanId in payment.csv: 39,952 out of 689,364
Unique loanId in loan.csv: 577,426 out of 577,682

payment and loan have different amount of rows, notice that payment’s loanID, suggesting that each row does not represent a unique customer. When we try to fit in to the machine learning model, we can’t just summarize either, since installments frequency and other ordinal variable might be a useful feature for our prediction. After joining this data set, we need to seperate those rows to be longer where each row represents one loanId. Due to this we plan to use an encoding methods later to fit in the model.

Additionally, it is quite strange that the loanId from loan.csv has different order of values, this might not be loanId at all.

show code
```{python}
mask = loan['loanId'].notna() & (~loan['loanId'].astype(str).str.startswith('LL-I-'))
different_ids = loan[mask]

different_ids[['loanId', 'originated', 'loanStatus', 'approved']]

```
loanId originated loanStatus approved
12264 LL-T-00260783 True Paid Off Loan True
22497 54bf7e63e4b0606ee9e256f8 False Withdrawn Application False
34873 54b1f646e4b06aa735b27461 False Withdrawn Application False
54003 LL-T-03462630 True External Collection True
92019 LL-T-00317558 True External Collection True
113655 LP-I-00000217 False Withdrawn Application False
117783 LP-I-00000134 True Paid Off Loan True
139375 LP-I-00000145 True Paid Off Loan True
149793 LP-I-00000073 True Paid Off Loan True
161901 LP-I-00000075 True Paid Off Loan True
164992 LP-I-00000082 True Paid Off Loan True
166010 54c13be3e4b0606eea00d977 False Withdrawn Application False
176666 54c5ae5de4b0606eea3a7f79 False Withdrawn Application False
213279 LP-I-00000229 False Pending Application Fee False
217443 54c2911ae4b0606eea17ba44 False Withdrawn Application False
220659 54bf5316e4b0606ee9e22fd7 False Withdrawn Application False
224930 LP-I-00000158 False Pending Application Fee False
227052 LP-I-00000071 False Withdrawn Application False
228429 LP-I-00000059 False Withdrawn Application False
230554 54c5041fe4b0606eea303fe1 False Withdrawn Application False
250789 54bc36bee4b0ff4c40b4f958 False Withdrawn Application False
252437 54cf4f44e4b0ba763e583495 False Withdrawn Application False
253797 LP-I-00000127 True Paid Off Loan True
254789 LP-I-00000108 False Withdrawn Application False
257884 54bc5f98e4b0ff4c40b67e18 False Withdrawn Application False
293166 LP-I-00000142 True Customer Voided New Loan True
298085 LP-I-00000133 False Withdrawn Application False
299154 LP-I-00000193 False Withdrawn Application False
316927 LP-I-00000173 False Withdrawn Application False
318941 LP-I-00000200 False Pending Application Fee False
359016 54c22819e4b0606eea0e36e0 False Withdrawn Application False
389404 LP-I-00000141 False Withdrawn Application False
397049 54b71432e4b0fc49086daa02 False Withdrawn Application False
411752 LP-I-00000139 True Paid Off Loan True
427263 LP-I-00000189 False Withdrawn Application False
434562 LP-I-00000030 True Paid Off Loan True
446986 54c50424e4b0606eea30407e False Withdrawn Application False
480873 LP-I-00000029 False CSR Voided New Loan False
485533 LP-I-00000112 False Pending Application Fee False
493704 LL-T-01984747 True External Collection True
520443 LP-I-00000109 True Paid Off Loan True
566283 LP-I-00000061 True Paid Off Loan True
574360 LP-I-00000166 False Pending Application Fee False

These id are a result of loanStatus being paid off, withdrawn, pending application, void and external collection.

2.1.3 Joining data

show code
```{python}
df_merged = (payment.merge(loan, on='loanId', how='left')
                  .merge(clarity, 
                        left_on='clarityFraudId', 
                        right_on='underwritingid', 
                        how='left'))
   
# Display first 5 rows and summary
pd.set_option('display.max_columns', None)
print("------------------ Joined Data Frame ------------------")
print("Joined dataset of loan.csv, payment.csv and clarity_underwriting_variables.csv")
print(f"Total Rows: {len(df_merged):,}, Total Columns: {len(df_merged.columns)}")

df_merged.head()
```
------------------ Joined Data Frame ------------------
Joined dataset of loan.csv, payment.csv and clarity_underwriting_variables.csv
Total Rows: 689,364, Total Columns: 81
loanId installmentIndex isCollection paymentDate principal fees paymentAmount paymentStatus paymentReturnCode anon_ssn payFrequency apr applicationDate originated originatedDate nPaidOff approved isFunded loanStatus loanAmount originallyScheduledPaymentAmount state leadType leadCost fpStatus clarityFraudId hasCF .underwritingdataclarity.clearfraud.clearfraudinquiry.thirtydaysago .underwritingdataclarity.clearfraud.clearfraudinquiry.twentyfourhoursago .underwritingdataclarity.clearfraud.clearfraudinquiry.oneminuteago .underwritingdataclarity.clearfraud.clearfraudinquiry.onehourago .underwritingdataclarity.clearfraud.clearfraudinquiry.ninetydaysago .underwritingdataclarity.clearfraud.clearfraudinquiry.sevendaysago .underwritingdataclarity.clearfraud.clearfraudinquiry.tenminutesago .underwritingdataclarity.clearfraud.clearfraudinquiry.fifteendaysago .underwritingdataclarity.clearfraud.clearfraudinquiry.threesixtyfivedaysago .underwritingdataclarity.clearfraud.clearfraudindicator.inquiryonfilecurrentaddressconflict .underwritingdataclarity.clearfraud.clearfraudindicator.totalnumberoffraudindicators .underwritingdataclarity.clearfraud.clearfraudindicator.telephonenumberinconsistentwithaddress .underwritingdataclarity.clearfraud.clearfraudindicator.inquiryageyoungerthanssnissuedate .underwritingdataclarity.clearfraud.clearfraudindicator.onfileaddresscautious .underwritingdataclarity.clearfraud.clearfraudindicator.inquiryaddressnonresidential .underwritingdataclarity.clearfraud.clearfraudindicator.onfileaddresshighrisk .underwritingdataclarity.clearfraud.clearfraudindicator.ssnreportedmorefrequentlyforanother .underwritingdataclarity.clearfraud.clearfraudindicator.currentaddressreportedbytradeopenlt90days .underwritingdataclarity.clearfraud.clearfraudindicator.inputssninvalid .underwritingdataclarity.clearfraud.clearfraudindicator.inputssnissuedatecannotbeverified .underwritingdataclarity.clearfraud.clearfraudindicator.inquiryaddresscautious .underwritingdataclarity.clearfraud.clearfraudindicator.morethan3inquiriesinthelast30days .underwritingdataclarity.clearfraud.clearfraudindicator.onfileaddressnonresidential .underwritingdataclarity.clearfraud.clearfraudindicator.creditestablishedpriortossnissuedate .underwritingdataclarity.clearfraud.clearfraudindicator.driverlicenseformatinvalid .underwritingdataclarity.clearfraud.clearfraudindicator.inputssnrecordedasdeceased .underwritingdataclarity.clearfraud.clearfraudindicator.inquiryaddresshighrisk .underwritingdataclarity.clearfraud.clearfraudindicator.inquirycurrentaddressnotonfile .underwritingdataclarity.clearfraud.clearfraudindicator.bestonfilessnissuedatecannotbeverified .underwritingdataclarity.clearfraud.clearfraudindicator.highprobabilityssnbelongstoanother .underwritingdataclarity.clearfraud.clearfraudindicator.maxnumberofssnswithanybankaccount .underwritingdataclarity.clearfraud.clearfraudindicator.bestonfilessnrecordedasdeceased .underwritingdataclarity.clearfraud.clearfraudindicator.currentaddressreportedbynewtradeonly .underwritingdataclarity.clearfraud.clearfraudindicator.creditestablishedbeforeage18 .underwritingdataclarity.clearfraud.clearfraudindicator.telephonenumberinconsistentwithstate .underwritingdataclarity.clearfraud.clearfraudindicator.driverlicenseinconsistentwithonfile .underwritingdataclarity.clearfraud.clearfraudindicator.workphonepreviouslylistedascellphone .underwritingdataclarity.clearfraud.clearfraudindicator.workphonepreviouslylistedashomephone .underwritingdataclarity.clearfraud.clearfraudidentityverification.ssnnamematch .underwritingdataclarity.clearfraud.clearfraudidentityverification.nameaddressmatch .underwritingdataclarity.clearfraud.clearfraudidentityverification.phonematchtype .underwritingdataclarity.clearfraud.clearfraudidentityverification.ssnnamereasoncodedescription .underwritingdataclarity.clearfraud.clearfraudidentityverification.phonematchresult .underwritingdataclarity.clearfraud.clearfraudidentityverification.nameaddressreasoncodedescription .underwritingdataclarity.clearfraud.clearfraudidentityverification.phonematchtypedescription .underwritingdataclarity.clearfraud.clearfraudidentityverification.overallmatchresult .underwritingdataclarity.clearfraud.clearfraudidentityverification.phonetype .underwritingdataclarity.clearfraud.clearfraudidentityverification.ssndobreasoncode .underwritingdataclarity.clearfraud.clearfraudidentityverification.ssnnamereasoncode .underwritingdataclarity.clearfraud.clearfraudidentityverification.nameaddressreasoncode .underwritingdataclarity.clearfraud.clearfraudidentityverification.ssndobmatch .underwritingdataclarity.clearfraud.clearfraudidentityverification.overallmatchreasoncode clearfraudscore underwritingid
0 LL-I-00000021 1 False 2014-12-19T05:00:00 22.33 147.28 169.61 Checked NaN 3e02899b53aa4877f9823b8ed5d5a6f1 B 478.67 2014-12-03T00:31:22.306000 True 2014-12-03T03:12:58.466000 NaN True 1 Settlement Paid Off 800.0 2035.32 OH lead 25 Checked NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 LL-I-00000021 2 False 2015-01-02T05:00:00 26.44 143.17 169.61 Checked NaN 3e02899b53aa4877f9823b8ed5d5a6f1 B 478.67 2014-12-03T00:31:22.306000 True 2014-12-03T03:12:58.466000 NaN True 1 Settlement Paid Off 800.0 2035.32 OH lead 25 Checked NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 LL-I-00000021 3 False 2015-01-16T05:00:00 31.30 138.31 169.61 Checked NaN 3e02899b53aa4877f9823b8ed5d5a6f1 B 478.67 2014-12-03T00:31:22.306000 True 2014-12-03T03:12:58.466000 NaN True 1 Settlement Paid Off 800.0 2035.32 OH lead 25 Checked NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 LL-I-00000021 4 False 2015-01-30T05:00:00 37.07 132.54 169.61 Checked NaN 3e02899b53aa4877f9823b8ed5d5a6f1 B 478.67 2014-12-03T00:31:22.306000 True 2014-12-03T03:12:58.466000 NaN True 1 Settlement Paid Off 800.0 2035.32 OH lead 25 Checked NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 LL-I-00000021 5 False 2015-02-13T05:00:00 43.89 125.72 169.61 Checked NaN 3e02899b53aa4877f9823b8ed5d5a6f1 B 478.67 2014-12-03T00:31:22.306000 True 2014-12-03T03:12:58.466000 NaN True 1 Settlement Paid Off 800.0 2035.32 OH lead 25 Checked NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Before we can explore further on this dataset, we want to know if there are any missing values.

Additionally, we want to assess whether this join is successful or not

show code
```{python}
def check_missing():
   # Calculate missing percentages
   missing_pct = (df_merged.isnull().mean() * 100).reset_index()
   missing_pct.columns = ['variable', 'percent_missing']
   
   # Calculate missing counts
   missing_counts = df_merged.isnull().sum().reset_index()
   missing_counts.columns = ['variable', 'n_missing']
   
   # Combine and calculate not missing
   missing_stats = missing_counts.merge(missing_pct, on='variable')
   missing_stats['not_missing'] = len(df_merged) - missing_stats['n_missing']
   
   # Filter and sort
   missing_stats = missing_stats[missing_stats['percent_missing'] > 0].sort_values(
       'percent_missing', ascending=False)
   
   missing_stats['percent_missing'] = missing_stats['percent_missing'].round(5)
   
   return missing_stats

check_missing()
```
variable n_missing percent_missing not_missing
73 .underwritingdataclarity.clearfraud.clearfraud... 672458 97.54759 16906
75 .underwritingdataclarity.clearfraud.clearfraud... 660177 95.76610 29187
68 .underwritingdataclarity.clearfraud.clearfraud... 660177 95.76610 29187
8 paymentReturnCode 657831 95.42578 31533
76 .underwritingdataclarity.clearfraud.clearfraud... 626264 90.84664 63100
70 .underwritingdataclarity.clearfraud.clearfraud... 626264 90.84664 63100
74 .underwritingdataclarity.clearfraud.clearfraud... 584855 84.83979 104509
62 .underwritingdataclarity.clearfraud.clearfraud... 579656 84.08562 109708
64 .underwritingdataclarity.clearfraud.clearfraud... 419967 60.92094 269397
63 .underwritingdataclarity.clearfraud.clearfraud... 419967 60.92094 269397
51 .underwritingdataclarity.clearfraud.clearfraud... 196965 28.57199 492399
7 paymentStatus 164057 23.79831 525307
67 .underwritingdataclarity.clearfraud.clearfraud... 141576 20.53719 547788
71 .underwritingdataclarity.clearfraud.clearfraud... 141576 20.53719 547788
61 .underwritingdataclarity.clearfraud.clearfraud... 138708 20.12116 550656
79 clearfraudscore 132637 19.24049 556727
48 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
60 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
59 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
58 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
56 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
55 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
54 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
53 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
52 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
69 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
49 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
50 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
47 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
41 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
36 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
38 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
39 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
46 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
40 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
42 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
43 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
44 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
45 .underwritingdataclarity.clearfraud.clearfraud... 131618 19.09267 557746
77 .underwritingdataclarity.clearfraud.clearfraud... 131477 19.07222 557887
72 .underwritingdataclarity.clearfraud.clearfraud... 131477 19.07222 557887
78 .underwritingdataclarity.clearfraud.clearfraud... 131477 19.07222 557887
65 .underwritingdataclarity.clearfraud.clearfraud... 131477 19.07222 557887
66 .underwritingdataclarity.clearfraud.clearfraud... 131477 19.07222 557887
37 .underwritingdataclarity.clearfraud.clearfraud... 131281 19.04379 558083
57 .underwritingdataclarity.clearfraud.clearfraud... 131281 19.04379 558083
27 .underwritingdataclarity.clearfraud.clearfraud... 131062 19.01202 558302
33 .underwritingdataclarity.clearfraud.clearfraud... 131062 19.01202 558302
28 .underwritingdataclarity.clearfraud.clearfraud... 131062 19.01202 558302
29 .underwritingdataclarity.clearfraud.clearfraud... 131062 19.01202 558302
30 .underwritingdataclarity.clearfraud.clearfraud... 131062 19.01202 558302
31 .underwritingdataclarity.clearfraud.clearfraud... 131062 19.01202 558302
34 .underwritingdataclarity.clearfraud.clearfraud... 131062 19.01202 558302
32 .underwritingdataclarity.clearfraud.clearfraud... 131062 19.01202 558302
35 .underwritingdataclarity.clearfraud.clearfraud... 131062 19.01202 558302
80 underwritingid 131038 19.00854 558326
25 clarityFraudId 115552 16.76212 573812
24 fpStatus 17706 2.56845 671658
15 nPaidOff 383 0.05556 688981
14 originatedDate 269 0.03902 689095

it seems the initial fraud screening has 19-20% missing values while clear fraud identity verification has missing rates of 95-97% implying this is a result of selective verfication process based on risk assessment needs. Due to this we plan to not include all the variables for the model as the high NA values might add white noise to the model. Although, the clarity underwriting values that we’ll prioritize is:

  • clearFraudScore (measure overall fraud risk)

  • totalnumberoffraudindicators (aggregate fraud signals)

  • clearfraudinquiry (inquiries of n-th times ago)

  • ssnnamematch (verify identity)

  • nameaddressmatch (address verification)

  • overallmatchresult (combined verficiation score)

  • creditestablishedbeforeage18 (potential fraud)

  • currentaddressreportedbynewtradeonly (suspicious signal)

  • morethan3inquiriesinthelast30days (shows desperation)

  • inputssninvalid (basic fraud check)

2.2 Determining the dependent variable.

show code
```{python}
print_distinct(payment,'paymentStatus')
```
-------------- paymentStatus --------------
| Checked | Rejected | Cancelled | nan | Skipped |
| Returned | Pending | Rejected Awaiting Retry | Complete |

After the exploratory data analysis, PaymentStatus would be the ideal dependent variable for predicting loan risk. This because it captures the complete process of loan performance. Payment status have no NA values and updates throughout the loan term. Additionally, paymentstatus reveals the 9 indicators of loan progression from “Checked” to “Complete”. This variable aligns perfectly with the business context where it is a measure of the applicant’s behaviour and the loan’s health.

3. Feature Engineering

Our goal here is to reduce dimensionality and create a better predictive power. realistically, we’d implement the use of a white box model. That is because we plan to create a model that can still be interpret rather than being highly accurate. To compensate on predictive power, we want the model to be explainable due to financial regulations prefer our fraud prediction to be transparent for fair lending practices, regulatory compliance, model validation and customer explanation requirements.

However, for this analysis purpose we will try to explore both black and white box model as our comparison to see how much predictive power have we trade off for interpretability.

3.1 Selecting Useful Inputs

For the maching learning model to work, we wanted to delete any redundant variables that will not be useful for prediction. This include, ID , highly correlated variables (principal and fees with payment amount) and dates. We also wanted to remove state as that might introduce regional bias. Although, we plan to keep the loanID to aggregate the columns later.

show code
```{python}
# Select initial features
selected_cols = [
    'loanId', 'paymentStatus', 'installmentIndex', 'isCollection',
    'principal', 'fees', 'paymentReturnCode', 'payFrequency', 'apr',
    'isFunded', 'approved', 'loanAmount', 'originallyScheduledPaymentAmount',
    'nPaidOff', 'originated', 'loanStatus', 'leadType', 'fpStatus',
    'clearfraudscore'
]

# Clarity columns -> underwriting variables that we belive to be useful
fraud_patterns = [
    'clearfraudinquiry', 'creditestablishedbeforeage18', 'ssnnamematch',
    'currentaddressreportedbynewtradeonly', 'morethan3inquiriesinthelast30days',
    'overallmatchresult', 'totalnumberoffraudindicators', 'inputssninvalid',
    'nameaddressmatch'
]

for pattern in fraud_patterns:
    pattern_cols = [col for col in df_merged.columns if pattern in col.lower()]
    selected_cols.extend(pattern_cols)

df_feature = df_merged[selected_cols].copy()

# Rename columns by taking last part after dot
rename_dict = {col: col.split('.')[-1] for col in df_feature.columns if '.' in col}
df_feature = df_feature.rename(columns=rename_dict)

print(f"Columns: {len(df_feature.columns)}, Rows: {len(df_feature):,}")
df_feature.head()
```
Columns: 36, Rows: 689,364
loanId paymentStatus installmentIndex isCollection principal fees paymentReturnCode payFrequency apr isFunded approved loanAmount originallyScheduledPaymentAmount nPaidOff originated loanStatus leadType fpStatus clearfraudscore thirtydaysago twentyfourhoursago oneminuteago onehourago ninetydaysago sevendaysago tenminutesago fifteendaysago threesixtyfivedaysago creditestablishedbeforeage18 ssnnamematch currentaddressreportedbynewtradeonly morethan3inquiriesinthelast30days overallmatchresult totalnumberoffraudindicators inputssninvalid nameaddressmatch
0 LL-I-00000021 Checked 1 False 22.33 147.28 NaN B 478.67 1 True 800.0 2035.32 NaN True Settlement Paid Off lead Checked NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 LL-I-00000021 Checked 2 False 26.44 143.17 NaN B 478.67 1 True 800.0 2035.32 NaN True Settlement Paid Off lead Checked NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 LL-I-00000021 Checked 3 False 31.30 138.31 NaN B 478.67 1 True 800.0 2035.32 NaN True Settlement Paid Off lead Checked NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 LL-I-00000021 Checked 4 False 37.07 132.54 NaN B 478.67 1 True 800.0 2035.32 NaN True Settlement Paid Off lead Checked NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 LL-I-00000021 Checked 5 False 43.89 125.72 NaN B 478.67 1 True 800.0 2035.32 NaN True Settlement Paid Off lead Checked NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
show code
```{python}
def plot_corr():
   df_num = df_feature.select_dtypes(include=['float64', 'int64']).dropna() 
   df_num.columns = [col[:10] for col in df_num.columns]
   corr = df_num.corr()
   
   plt.figure(figsize=(12, 10))
   sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm')
   plt.tight_layout()
   return plt.show()
plot_corr()
```

This correlation matrix present correlation challenges for the model. High multicollineaity between time variables might need to be aggregated to prevent redundancy. Creating ratio features would probably be efficient. But for now, let’s just select one to simplify this model, we’d select fifteendaysago. For the installment index, as discussed previously in data preprocessing stage we plan to implement an encoding methods to aggregate the data.

3.2 Missing Values by removal

show code
```{python}
missing_values(df_feature)
```

paymentReturnCode                       95.425784
paymentStatus                           23.798313
clearfraudscore                         19.240488
inputssninvalid                         19.092671
morethan3inquiriesinthelast30days       19.092671
currentaddressreportedbynewtradeonly    19.092671
creditestablishedbeforeage18            19.092671
overallmatchresult                      19.072217
ssnnamematch                            19.072217
nameaddressmatch                        19.072217
totalnumberoffraudindicators            19.043785
ninetydaysago                           19.012017
twentyfourhoursago                      19.012017
fifteendaysago                          19.012017
tenminutesago                           19.012017
sevendaysago                            19.012017
onehourago                              19.012017
oneminuteago                            19.012017
thirtydaysago                           19.012017
threesixtyfivedaysago                   19.012017
fpStatus                                 2.568454
nPaidOff                                 0.055558
apr                                      0.000000
fees                                     0.000000
principal                                0.000000
payFrequency                             0.000000
isCollection                             0.000000
installmentIndex                         0.000000
approved                                 0.000000
loanAmount                               0.000000
isFunded                                 0.000000
leadType                                 0.000000
loanStatus                               0.000000
originated                               0.000000
originallyScheduledPaymentAmount         0.000000
loanId                                   0.000000
dtype: float64

If we want to filter the data we need to make sure the missing data is Missing Completely At Random (MCAR) so it wont create selection bias, hoping this missingness doesn’t correlate with important outcomes. Additionally the payment return code has 95% missing values, it is better to completely drop them. The fraud indicators likely to be Missing At Random (MAR) or Missing Not At Random (MNAR) as the missing visualization shows consistent pattern. to double check, this data is a result of joining three tables. Hence, let’s visualize the clarity fraud id with clearfraudscore on the previous joined dataset. However, nPaidoff has <0.1% missing values hence, the impact on the dataset size will be minimal since the remaining 99.9% provides sufficient information.

show code
```{python}
missing_values(df_merged[['clarityFraudId', 'clearfraudscore']])
```

clearfraudscore    19.240488
clarityFraudId     16.762117
dtype: float64

it seems that the missing values from clarity fraud id and clearfraudscore is not align. Meaning this is not because their clarity ID does not exist. However, we can remove the fraud indicators when there isn’t any clarity fraud id while the indicator with present fraud id will be imputed using K-Nearest Neighbor.

show code
```{python}
imputed = df_feature.copy()
imputed = imputed.drop(columns=['paymentReturnCode'])

# Filter out null values in key columns
df_impute = imputed[imputed['nPaidOff'].notna()]

missing_values(df_impute)
```

paymentStatus                           23.795141
clearfraudscore                         19.200529
inputssninvalid                         19.052630
morethan3inquiriesinthelast30days       19.052630
currentaddressreportedbynewtradeonly    19.052630
creditestablishedbeforeage18            19.052630
overallmatchresult                      19.032165
ssnnamematch                            19.032165
nameaddressmatch                        19.032165
totalnumberoffraudindicators            19.003717
ninetydaysago                           18.971931
thirtydaysago                           18.971931
tenminutesago                           18.971931
sevendaysago                            18.971931
onehourago                              18.971931
oneminuteago                            18.971931
twentyfourhoursago                      18.971931
fifteendaysago                          18.971931
threesixtyfivedaysago                   18.971931
fpStatus                                 2.569882
fees                                     0.000000
principal                                0.000000
apr                                      0.000000
isCollection                             0.000000
installmentIndex                         0.000000
payFrequency                             0.000000
approved                                 0.000000
loanAmount                               0.000000
isFunded                                 0.000000
leadType                                 0.000000
loanStatus                               0.000000
originated                               0.000000
nPaidOff                                 0.000000
originallyScheduledPaymentAmount         0.000000
loanId                                   0.000000
dtype: float64

Fortunately, at this point the dataset is small enough to be deleted. We still have sufficient information with the remaining variable. Otherwise, we do not want the imputated data to mess with the model.

show code
```{python}
df_impute = df_impute.dropna()
columns_to_drop = [col for col in df_impute.columns if col.endswith('ago') and col != 'fifteendaysago']
df_impute = df_impute.drop(columns=columns_to_drop)
```
show code
```{python}
df_impute.head()
```
loanId paymentStatus installmentIndex isCollection principal fees payFrequency apr isFunded approved loanAmount originallyScheduledPaymentAmount nPaidOff originated loanStatus leadType fpStatus clearfraudscore fifteendaysago creditestablishedbeforeage18 ssnnamematch currentaddressreportedbynewtradeonly morethan3inquiriesinthelast30days overallmatchresult totalnumberoffraudindicators inputssninvalid nameaddressmatch
5957 LL-I-00202645 Checked 1 False 0.00 31.56 W 478.67 1 True 600.0 1463.49 0.0 True Paid Off Loan lead Checked 768.0 5.0 False match False False match 1.0 False match
5958 LL-I-00202645 Checked 2 False 8.40 55.23 W 478.67 1 True 600.0 1463.49 0.0 True Paid Off Loan lead Checked 768.0 5.0 False match False False match 1.0 False match
5959 LL-I-00202645 Checked 3 False 9.17 54.46 W 478.67 1 True 600.0 1463.49 0.0 True Paid Off Loan lead Checked 768.0 5.0 False match False False match 1.0 False match
5960 LL-I-00202645 Checked 4 False 572.41 0.00 W 478.67 1 True 600.0 1463.49 0.0 True Paid Off Loan lead Checked 768.0 5.0 False match False False match 1.0 False match
5961 LL-I-00202645 Cancelled 5 False 10.02 53.61 W 478.67 1 True 600.0 1463.49 0.0 True Paid Off Loan lead Checked 768.0 5.0 False match False False match 1.0 False match

Alright, now on to the encoding methods…

3.2 Encoding Methods

Now before we continue further, there are two ways this could work. If we retain all the long information from payment.csv which allows the same loanId to be present untill all installment index, Or we derive them sequentially and create a weighted risk level assessment. From here lets create the function we need and then we try to do both.

I/0 Binary Encoding

show code
```{python}
categorical = ['paymentStatus', 'payFrequency', 'loanStatus', 'leadType', 'fpStatus', 'ssnnamematch', 'overallmatchresult', 'nameaddressmatch']

binary = ['isCollection', 'approved', 'originated', 'creditestablishedbeforeage18',
                'currentaddressreportedbynewtradeonly', 'morethan3inquiriesinthelast30days',
                'inputssninvalid']

# One-hot encode binary columns
df_encoded = pd.get_dummies(df_impute, columns=binary, drop_first=False)

bool_columns = df_encoded.select_dtypes(include=['bool']).columns

# Convert boolean to 0/1 integers
for col in bool_columns:
   df_encoded[col] = df_encoded[col].astype(int)

df_encoded
```
loanId paymentStatus installmentIndex principal fees payFrequency apr isFunded loanAmount originallyScheduledPaymentAmount nPaidOff loanStatus leadType fpStatus clearfraudscore fifteendaysago ssnnamematch overallmatchresult totalnumberoffraudindicators nameaddressmatch isCollection_False isCollection_True approved_True originated_True creditestablishedbeforeage18_False creditestablishedbeforeage18_True currentaddressreportedbynewtradeonly_False currentaddressreportedbynewtradeonly_True morethan3inquiriesinthelast30days_False morethan3inquiriesinthelast30days_True inputssninvalid_False
5957 LL-I-00202645 Checked 1 0.00 31.56 W 478.67 1 600.0 1463.49 0.0 Paid Off Loan lead Checked 768.0 5.0 match match 1.0 match 1 0 1 1 1 0 1 0 1 0 1
5958 LL-I-00202645 Checked 2 8.40 55.23 W 478.67 1 600.0 1463.49 0.0 Paid Off Loan lead Checked 768.0 5.0 match match 1.0 match 1 0 1 1 1 0 1 0 1 0 1
5959 LL-I-00202645 Checked 3 9.17 54.46 W 478.67 1 600.0 1463.49 0.0 Paid Off Loan lead Checked 768.0 5.0 match match 1.0 match 1 0 1 1 1 0 1 0 1 0 1
5960 LL-I-00202645 Checked 4 572.41 0.00 W 478.67 1 600.0 1463.49 0.0 Paid Off Loan lead Checked 768.0 5.0 match match 1.0 match 1 0 1 1 1 0 1 0 1 0 1
5961 LL-I-00202645 Cancelled 5 10.02 53.61 W 478.67 1 600.0 1463.49 0.0 Paid Off Loan lead Checked 768.0 5.0 match match 1.0 match 1 0 1 1 1 0 1 0 1 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
685598 LL-I-18518246 Cancelled 9 100.06 177.66 S 645.00 1 1000.0 3332.75 0.0 Paid Off Loan lead Checked 694.0 6.0 match match 2.0 match 1 0 1 1 1 0 1 0 1 0 1
685599 LL-I-18518246 Cancelled 10 92.05 185.67 S 645.00 1 1000.0 3332.75 0.0 Paid Off Loan lead Checked 694.0 6.0 match match 2.0 match 1 0 1 1 1 0 1 0 1 0 1
685600 LL-I-18518246 Cancelled 11 138.29 139.43 S 645.00 1 1000.0 3332.75 0.0 Paid Off Loan lead Checked 694.0 6.0 match match 2.0 match 1 0 1 1 1 0 1 0 1 0 1
685601 LL-I-18518246 Cancelled 12 168.10 109.62 S 645.00 1 1000.0 3332.75 0.0 Paid Off Loan lead Checked 694.0 6.0 match match 2.0 match 1 0 1 1 1 0 1 0 1 0 1
685602 LL-I-18518246 Cancelled 13 219.62 58.21 S 645.00 1 1000.0 3332.75 0.0 Paid Off Loan lead Checked 694.0 6.0 match match 2.0 match 1 0 1 1 1 0 1 0 1 0 1

423402 rows × 31 columns

show code
```{python}
binary_list = []
for x in binary:
    binary_list.append(f"{x}_True")
    binary_list.append(f"{x}_False")
print(binary_list)
```
['isCollection_True', 'isCollection_False', 'approved_True', 'approved_False', 'originated_True', 'originated_False', 'creditestablishedbeforeage18_True', 'creditestablishedbeforeage18_False', 'currentaddressreportedbynewtradeonly_True', 'currentaddressreportedbynewtradeonly_False', 'morethan3inquiriesinthelast30days_True', 'morethan3inquiriesinthelast30days_False', 'inputssninvalid_True', 'inputssninvalid_False']

Label encoding

show code
```{python}
# Print your categorical columns
print("Categorical columns:", categorical)

# Print dataframe columns 
print("\nDataframe columns:", df_encoded.columns.tolist())
```
Categorical columns: ['paymentStatus', 'payFrequency', 'loanStatus', 'leadType', 'fpStatus', 'ssnnamematch', 'overallmatchresult', 'nameaddressmatch']

Dataframe columns: ['loanId', 'paymentStatus', 'installmentIndex', 'principal', 'fees', 'payFrequency', 'apr', 'isFunded', 'loanAmount', 'originallyScheduledPaymentAmount', 'nPaidOff', 'loanStatus', 'leadType', 'fpStatus', 'clearfraudscore', 'fifteendaysago', 'ssnnamematch', 'overallmatchresult', 'totalnumberoffraudindicators', 'nameaddressmatch', 'isCollection_False', 'isCollection_True', 'approved_True', 'originated_True', 'creditestablishedbeforeage18_False', 'creditestablishedbeforeage18_True', 'currentaddressreportedbynewtradeonly_False', 'currentaddressreportedbynewtradeonly_True', 'morethan3inquiriesinthelast30days_False', 'morethan3inquiriesinthelast30days_True', 'inputssninvalid_False']
show code
```{python}
def encode_categories(df):
    
    df_encoded = df.copy()

    encoders = {}
    
    for col in categorical:
        encoders[col] = LabelEncoder()
        df_encoded[col] = encoders[col].fit_transform(df[col].fillna('Missing'))
    
    return df_encoded, encoders


df_en, encoders = encode_categories(df_encoded)

for i in categorical[1:]:
    print_distinct(df_en, i)
```
-------------- payFrequency --------------
| 4 | 0 | 3 | 2 | 1 |
-------------- loanStatus --------------
| 5 | 2 | 9 | 1 | 6 |
| 8 | 3 | 4 | 7 | 0 |
-------------- leadType --------------
| 2 | 4 | 0 | 1 | 5 |
| 3 | 7 | 6 |
-------------- fpStatus --------------
| 1 | 2 | 0 | 3 |
-------------- ssnnamematch --------------
| 0 | 2 | 1 | 3 |
-------------- overallmatchresult --------------
| 0 | 3 | 2 | 1 |
-------------- nameaddressmatch --------------
| 1 | 0 | 2 | 3 | 4 |
show code
```{python}
for label, value in enumerate(encoders['paymentStatus'].classes_):
    print(f"{value}: {label}")
```
Cancelled: 0
Checked: 1
Pending: 2
Rejected: 3
Rejected Awaiting Retry: 4
Skipped: 5

Great! it works. Our last concern now is how do we aggregate the installment index and the behaviour on during those installments. If we were to sequentally approache the feature to add on like installment_1, payment_1, principal_1 then it would resulted in a missing value. Before that, we need to modify this label encoder.

show code
```{python}
def encode_loan_categories(df):
    df_encoded = df.copy()
    encoders = {}

    risk_order = {
        'Checked': 0,
        'Pending': 1, 
        'Rejected Awaiting Retry': 2,
        'Skipped': 3,
        'Cancelled': 4,
        'Rejected': 5,
        'None': 6
    }
    df_encoded['paymentStatus'] = df['paymentStatus'].map(risk_order)

    categorical = df.select_dtypes(include=['object']).columns
    for col in [c for c in categorical if c != 'paymentStatus']:
        encoders[col] = LabelEncoder()
        df_encoded[col] = encoders[col].fit_transform(df[col].fillna('Missing'))
    
    return df_encoded, encoders

# Apply encoding
df_en, encoders = encode_loan_categories(df_encoded)
```

Sequential Wide data frame transformation

show code
```{python}
def sequential_feature(df, max_installments):
    df_sorted = df.sort_values(['loanId', 'installmentIndex'])
    features = {}
    
    
    defaults = {
        'paymentStatus': 6,
        'installmentIndex': 0,
        'principal': 0.0,
        'fees': 0.0,
        'isCollection': 0  
    }
    
    for loan_id, loan_data in df_sorted.groupby('loanId'):
        loan_features = {}
        
        for i in range(1, max_installments + 1):
            loan_features.update({
                f'payment{i}_status': defaults['paymentStatus'],
                f'payment{i}_index': defaults['installmentIndex'],
                f'payment{i}_principal': defaults['principal'],
                f'payment{i}_fees': defaults['fees'],
                f'payment{i}_isCollection': defaults['isCollection']
            })
        
        # Fill actual values
        for i, payment in enumerate(loan_data.itertuples(), 1):
            if i <= max_installments:
                loan_features[f'payment{i}_status'] = payment.paymentStatus
                loan_features[f'payment{i}_index'] = payment.installmentIndex
                loan_features[f'payment{i}_principal'] = payment.principal
                loan_features[f'payment{i}_fees'] = payment.fees
                # Check which column exists
                if hasattr(payment, 'isCollection'):
                    loan_features[f'payment{i}_isCollection'] = payment.isCollection
                elif hasattr(payment, 'isCollection_True'):
                    loan_features[f'payment{i}_isCollection'] = payment.isCollection_True
                elif hasattr(payment, 'isCollection_False'):
                    loan_features[f'payment{i}_isCollection'] = not payment.isCollection_False
        
        features[loan_id] = loan_features
    
    return pd.DataFrame.from_dict(features, orient='index')

df_wide = sequential_feature(df_en, max(df_en['installmentIndex']))
```
show code
```{python}
for x in range(10):
    print_distinct(df_wide, f'payment{x+1}_status')
```
-------------- payment1_status --------------
| 0 | 5 | 4 | 3 | 1 |
-------------- payment2_status --------------
| 0 | 5 | 4 | 3 | 6 |
| 1 |
-------------- payment3_status --------------
| 0 | 5 | 4 | 1 | 3 |
| 6 |
-------------- payment4_status --------------
| 0 | 4 | 5 | 3 | 6 |
| 2 | 1 |
-------------- payment5_status --------------
| 4 | 0 | 5 | 3 | 6 |
| 1 |
-------------- payment6_status --------------
| 4 | 0 | 5 | 6 | 3 |
| 2 | 1 |
-------------- payment7_status --------------
| 4 | 0 | 5 | 6 | 3 |
| 1 |
-------------- payment8_status --------------
| 4 | 0 | 3 | 5 | 6 |
| 2 | 1 |
-------------- payment9_status --------------
| 4 | 0 | 5 | 3 | 6 |
| 1 |
-------------- payment10_status --------------
| 4 | 0 | 3 | 5 | 6 |
| 2 | 1 |

concurrent loanId long data frame

show code
```{python}
df_1, encoder_1= encode_categories(df_impute)
df_long = df_1.iloc[:, 1:]
df_long
```
paymentStatus installmentIndex isCollection principal fees payFrequency apr isFunded approved loanAmount originallyScheduledPaymentAmount nPaidOff originated loanStatus leadType fpStatus clearfraudscore fifteendaysago creditestablishedbeforeage18 ssnnamematch currentaddressreportedbynewtradeonly morethan3inquiriesinthelast30days overallmatchresult totalnumberoffraudindicators inputssninvalid nameaddressmatch
5957 1 1 False 0.00 31.56 4 478.67 1 True 600.0 1463.49 0.0 True 5 2 1 768.0 5.0 False 0 False False 0 1.0 False 1
5958 1 2 False 8.40 55.23 4 478.67 1 True 600.0 1463.49 0.0 True 5 2 1 768.0 5.0 False 0 False False 0 1.0 False 1
5959 1 3 False 9.17 54.46 4 478.67 1 True 600.0 1463.49 0.0 True 5 2 1 768.0 5.0 False 0 False False 0 1.0 False 1
5960 1 4 False 572.41 0.00 4 478.67 1 True 600.0 1463.49 0.0 True 5 2 1 768.0 5.0 False 0 False False 0 1.0 False 1
5961 0 5 False 10.02 53.61 4 478.67 1 True 600.0 1463.49 0.0 True 5 2 1 768.0 5.0 False 0 False False 0 1.0 False 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
685598 0 9 False 100.06 177.66 3 645.00 1 True 1000.0 3332.75 0.0 True 5 2 1 694.0 6.0 False 0 False False 0 2.0 False 1
685599 0 10 False 92.05 185.67 3 645.00 1 True 1000.0 3332.75 0.0 True 5 2 1 694.0 6.0 False 0 False False 0 2.0 False 1
685600 0 11 False 138.29 139.43 3 645.00 1 True 1000.0 3332.75 0.0 True 5 2 1 694.0 6.0 False 0 False False 0 2.0 False 1
685601 0 12 False 168.10 109.62 3 645.00 1 True 1000.0 3332.75 0.0 True 5 2 1 694.0 6.0 False 0 False False 0 2.0 False 1
685602 0 13 False 219.62 58.21 3 645.00 1 True 1000.0 3332.75 0.0 True 5 2 1 694.0 6.0 False 0 False False 0 2.0 False 1

423402 rows × 26 columns

4. Modelling

4.1 Story of two data

We’ve assembled two types of dataset for this model. To put it simply, one long and one wide.:

  • long dataset df_long : installmentIndex remains the same. For each installmentindex, the loan id is the same until it moves on the next installmentIndex
  • wide dataset df_wide : Each row is a unique loanId. installmentIndex is sequentially added along with supplementary information such as paymentStatus, and the rest.

df_long use case: this dataset will be fitted into a machine learning model attempting to predict payment status and interpret result df_wide use case: to develop our own algorithm that can categorize specific borrowers of their risk levels, aligning with our business context

4.2 Built-in Algorithm

show code
```{python}
def weighted_risk_score(df):
    status_columns = [col for col in df.columns if 'status' in col]
    
    # Define risk weights that increase with severity
    risk_weights = {
        0: 0.0,   # Completed - Lowest risk 
        1: 0.2,     
        2: 0.5,     
        3: 0.7,    
        4: 0.9,     
        5: 1.0      # Rejected - Maximum risk
    }
    
    def time_weight(payment_number):
        return 1 / (1 + 0.1 * payment_number)  
    

    def calculate_loan_risk(row):
        total_weight = 0
        risk_score = 0
        
        for i, col in enumerate(status_columns):
            status_value = row[col]
            risk_value = risk_weights.get(status_value, 0)
            time_value = time_weight(i)
            
            risk_score += risk_value * time_value
            total_weight += time_value
            
        return (risk_score / total_weight) if total_weight > 0 else 0
    df['risk_score'] = df.apply(calculate_loan_risk, axis=1)
    # Categorize risk levels
    df['risk_category'] = pd.cut(df['risk_score'], 
                                bins=[0, 0.2, 0.4, 0.6, 0.8, 1.0],
                                labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])
    
    return df

df_result = weighted_risk_score(df_wide)
```
show code
```{python}
loan_ids = pd.DataFrame(df_impute['loanId'].unique(), columns=['loanId'])

df_risk = pd.concat([loan_ids, df_result[['risk_score', 'risk_category']]], axis=1)

df_risk
```
loanId risk_score risk_category
0 LL-I-00202645 0.339224 Low
1 LL-I-00202774 0.266060 Low
2 LL-I-00204105 0.000000 NaN
3 LL-I-00204517 0.053147 Very Low
4 LL-I-00206077 0.242621 Low
... ... ... ...
30493 LL-I-18368434 0.008036 Very Low
30494 LL-I-18397547 0.008036 Very Low
30495 LL-I-18403128 0.241839 Low
30496 LL-I-18441013 0.283532 Low
30497 LL-I-18518246 0.066197 Very Low

30498 rows × 3 columns

Alright this is just a risk weightage algorithm I thought of as an intution. Therefore, no machine learning model is used.

The algorithm uses two main components:

  1. Risk Weights: We assign increasing weights from 0 to 1 based on payment status severity. It’s like a grading scale where:
  • 0 (Checked) = A+ borrower, always pays on time
  • 0.2 (Pending) = B student, mostly reliable
  • 0.5 (Rejected Awaiting Retry) = C borrower, showing concerning signs
  • 0.7 (Cancelled) = D borrower, significant issues -> probably financial issues or personal.
  • 0.9 (Skipped) = F borrower, major red flags
  • 1.0 (Rejected) = Complete failure to meet requirements
  1. Time Decay Function: We use an exponential decay formula: 1/(1 + 0.1 * payment_number) Think of it like your memory - recent events are clearer than distant ones, but early payments carry special weight in predicting behavior.

The final risk score combines these factors: risk_score = Σ(risk_weight * time_weight) / Σ(time_weights)

I divide them by total weights to normalize this approach so we can ensure scores ranges between 0-1 making them easy to interpret and categorize into risk levels. This draws inspiration from FICO scores where they look at payment history with higher weights on recent activities.

Therefore, lets say loanId LL-I-00202645 has a risk score of 0.339224 on which we categorize it into low category. There are some issue with this. since some category is Nan, but this is up for discussions with other data scientist so we can develop our own model based on the relevant business context and situation.

4.3 Traditional Machine Learning model

4.3.1 Data Partitioning

show code
```{python}
X = df_long.drop('paymentStatus', axis=1)
y = df_long['paymentStatus']
print(f"Splitting data...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
```
Splitting data...

4.3.2 Model Building

show code
```{python}
class_weights = {0: 1, 1: 1, 2: 5, 3: 5, 4: 10, 5: 10, 6: 15}
models = {
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Logistic Regression': LogisticRegression(class_weight=class_weights, max_iter=1000),
'Decision Tree': DecisionTreeClassifier(random_state=42),
'KNN': KNeighborsClassifier(n_neighbors=5),
'Naive Bayes': GaussianNB()
}
results = {}
def machine_learning(X_train_scaled,y_train, model):
    
    print(f"fitting models...\n")


    print(f"Stage 1: fitting {model}")
    start = time.time()
    model.fit(X_train_scaled, y_train)

    print(f"Stage 2: predicting {model}")
    y_pred = model.predict(X_test_scaled)

    print(f"Stage 3: collecting accuracy {model}")
    accuracy = accuracy_score(y_test, y_pred)
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)


    results[model] = {
    'Accuracy': accuracy,
    'CV Mean': cv_scores.mean(),
    'CV Std': cv_scores.std(),
    'Model': model,
    'Classification Report': classification_report(y_test, y_pred)}

    print(f"Test Accuracy: {results[model]['Accuracy']:.4f}")
    print(f"CV Mean Accuracy: {results[model]['CV Mean']:.4f} (+/- {results[model]['CV Std']*2:.4f})")
    print("\nClassification Report:")
    print(results[model]['Classification Report'])
    end = time.time()
    print(f"{model} took {end-start/60} minutes\n")
    print("---------------------------------------------\n")

    return results, scaler
```

4.3.3 Random Forest

show code
```{python}
results_rf, scaler_rf = machine_learning(X_train_scaled,y_train, model=models['Random Forest'])
```
fitting models...

Stage 1: fitting RandomForestClassifier(random_state=42)
Stage 2: predicting RandomForestClassifier(random_state=42)
Stage 3: collecting accuracy RandomForestClassifier(random_state=42)
Test Accuracy: 0.8778
CV Mean Accuracy: 0.8705 (+/- 0.0015)

Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.94      0.93     43173
           1       0.86      0.93      0.89     34135
           2       0.63      0.20      0.30      1464
           3       0.61      0.31      0.41      5350
           4       1.00      0.50      0.67         2
           5       0.32      0.02      0.04       557

    accuracy                           0.88     84681
   macro avg       0.72      0.48      0.54     84681
weighted avg       0.86      0.88      0.86     84681

RandomForestClassifier(random_state=42) took 1704941585.7809746 minutes

---------------------------------------------

Random Forest results show 87.78% test accuracy with stable cross-validation (87.05% ± 0.15%). Strong performance for common statuses (Checked/Pending), weak for rare cases (Rejected/Skipped). Class imbalance evident from support numbers.

4.3.4 Logistic Regression

show code
```{python}
results_lr, scaler_lr = machine_learning(X_train_scaled,y_train, models['Logistic Regression'])
```
fitting models...

Stage 1: fitting LogisticRegression(class_weight={0: 1, 1: 1, 2: 5, 3: 5, 4: 10, 5: 10, 6: 15},
                   max_iter=1000)
Stage 2: predicting LogisticRegression(class_weight={0: 1, 1: 1, 2: 5, 3: 5, 4: 10, 5: 10, 6: 15},
                   max_iter=1000)
Stage 3: collecting accuracy LogisticRegression(class_weight={0: 1, 1: 1, 2: 5, 3: 5, 4: 10, 5: 10, 6: 15},
                   max_iter=1000)
Test Accuracy: 0.6727
CV Mean Accuracy: 0.6718 (+/- 0.0052)

Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.73      0.77     43173
           1       0.71      0.65      0.67     34135
           2       0.00      0.00      0.00      1464
           3       0.24      0.63      0.35      5350
           4       0.00      0.00      0.00         2
           5       0.12      0.00      0.00       557

    accuracy                           0.67     84681
   macro avg       0.31      0.33      0.30     84681
weighted avg       0.71      0.67      0.68     84681

LogisticRegression(class_weight={0: 1, 1: 1, 2: 5, 3: 5, 4: 10, 5: 10, 6: 15},
                   max_iter=1000) took 1704941611.1439714 minutes

---------------------------------------------

Logistic Regression achieves 67.28% accuracy (lower than random forest) with good detection of low-risk loans but critically fails on high-risk categories despite class weighting. Poor recall on risky loans makes it unsuitable for loan assessment.

4.3.5 Decision Tree

show code
```{python}
results_dt, scaler_dt = machine_learning(X_train_scaled,y_train, models['Decision Tree'])
```
fitting models...

Stage 1: fitting DecisionTreeClassifier(random_state=42)
Stage 2: predicting DecisionTreeClassifier(random_state=42)
Stage 3: collecting accuracy DecisionTreeClassifier(random_state=42)
Test Accuracy: 0.7997
CV Mean Accuracy: 0.7918 (+/- 0.0042)

Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.87      0.88     43173
           1       0.81      0.81      0.81     34135
           2       0.30      0.31      0.31      1464
           3       0.34      0.36      0.35      5350
           4       0.17      0.50      0.25         2
           5       0.05      0.07      0.06       557

    accuracy                           0.80     84681
   macro avg       0.43      0.49      0.44     84681
weighted avg       0.80      0.80      0.80     84681

DecisionTreeClassifier(random_state=42) took 1704941625.8401513 minutes

---------------------------------------------

The Decision Tree classifier achieves 80% accuracy, performing well on common payment statuses but struggling with rare cases. Its cross-validation stability at 79.18% demonstrates consistent but imperfect predictive ability.

4.3.5 K Nearest Neighbor (KNN)

show code
```{python}
results_dt, scaler_dt = machine_learning(X_train_scaled,y_train, models['KNN'])
```
fitting models...

Stage 1: fitting KNeighborsClassifier()
Stage 2: predicting KNeighborsClassifier()
Stage 3: collecting accuracy KNeighborsClassifier()
Test Accuracy: 0.8515
CV Mean Accuracy: 0.8362 (+/- 0.0020)

Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.96      0.91     43173
           1       0.86      0.89      0.88     34135
           2       0.32      0.03      0.05      1464
           3       0.30      0.06      0.10      5350
           4       0.00      0.00      0.00         2
           5       0.07      0.00      0.00       557

    accuracy                           0.85     84681
   macro avg       0.40      0.32      0.32     84681
weighted avg       0.81      0.85      0.82     84681

KNeighborsClassifier() took 1704941717.6601782 minutes

---------------------------------------------

85.15% accuracy, slightly lower than Random Forest (87.78%) but better than Logistic Regression (67.28%) and Decision Tree (79.97%). Strong with common payment statuses but near-zero performance on rare cases. Cross-validation shows stable performance at 83.62%.

4.3.6 Naive Bayes

show code
```{python}
results_dt, scaler_dt = machine_learning(X_train_scaled,y_train, models['Naive Bayes'])
```
fitting models...

Stage 1: fitting GaussianNB()
Stage 2: predicting GaussianNB()
Stage 3: collecting accuracy GaussianNB()
Test Accuracy: 0.1092
CV Mean Accuracy: 0.1119 (+/- 0.0087)

Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.17      0.29     43173
           1       0.61      0.03      0.07     34135
           2       0.04      0.00      0.00      1464
           3       0.12      0.01      0.02      5350
           4       0.00      1.00      0.00         2
           5       0.01      0.97      0.02       557

    accuracy                           0.11     84681
   macro avg       0.27      0.37      0.06     84681
weighted avg       0.68      0.11      0.17     84681

GaussianNB() took 1704941716.9999843 minutes

---------------------------------------------

Naive Bayes performs extremely poorly at 10.92% accuracy - significantly worse than all other models (RF: 87.78%, KNN: 85.15%, DT: 79.97%, LR: 72.92%). Shows unusual behavior by catching rare cases but missing common ones. NOT SUITABLE for this payment status prediction task.

4.3.7 Champion model and comparison

Random Forest emerges as the champion model with 87.78% accuracy, demonstrating superior performance across payment status predictions. Its strength lies in handling both common statuses (Checked/Pending: >90% precision) and maintaining moderate performance on rare cases.

Model comparison by accuracy: 1. Random Forest: 87.78% 2. KNN: 85.15% 3. Decision Tree: 79.97% 4. Logistic Regression: 72.92% 5. Naive Bayes: 10.92%

While Random Forest excels in accuracy, KNN performs well but struggles with rare cases. Logistic Regression handles only majority classes effectively. Naive Bayes proves unsuitable for this payment prediction task.

All models face challenges with class imbalance, particularly for statuses like Rejected and Skipped. For production implementation, Random Forest’s balance of accuracy and stability makes it the recommended choice for prediction. For interpretation However, decision tree will be best.

4.3.8 Interpreting decision tree

show code
```{python}
from sklearn.inspection import partial_dependence, PartialDependenceDisplay
import matplotlib.pyplot as plt

features_to_plot = [0, 1, 2, 3,4,5]  
feature_names = ['installmentIndex', 'isCollection', 'fees',
                 'fpStatus', 'clearfraudscore','payFrequency' ]

display = PartialDependenceDisplay.from_estimator(
  models['Decision Tree'].fit(X_train_scaled, y_train),
  X_train_scaled,
  features_to_plot,
  target=0,  # Add target class
  feature_names=feature_names
)
plt.tight_layout()
plt.show()
```

4.3.9 Decision Tree interpretation

According to the decision tree model, The partial dependence plots reveal that installment index is the strongest predictor, showing loan risk increases significantly in early payments before stabilizing. Fraud scores have a gradual negative impact, suggesting higher scores correlate with lower risk. Surprisingly, fees is the most stable and Collection status shows minimal influence. Additionally, first payment status maintains consistent impact across categories.

Interesting note is that pay frequency’s spikes might indicate that borrowers who choose unusual payment schedules could be facing underlying financial challenges. It’s like how someone asking for non-standard payment arrangements might be showing signs of financial stress.

5. Model Limitation

as expected, the machine learning model for random forest, decision tree and knn is predict far superior than Logistic regression and Naive Bayes

Limitation:

  • As this is just a simple model before we scale up, our model will eventually become more sophisticated.
  • ALL models struggle with minority classes
  • Data is heavily imbalanced, which might explain why they all performed poorly on minority classess. As algorithms naturally bias toward majority classes to maximize overall accuracy.
  • Models is not ready for professional deployment, need to perform hyperparameter tuning

In our loan payment analysis, with common statuses like ‘Checked’ having nearly 40 times more examples than ‘Skipped’ or ‘Rejected’, models couldn’t learn reliable patterns for rare cases. This imbalance particularly affected simpler algorithms like Logistic Regression, which failed completely on minority classes. Even the more sophisticated Random Forest, while performing better overall, still showed significant weakness in predicting these critical but uncommon payment statuses.

Potential Solutions:

  • Oversampling methods
  • Class weights
  • Collecting more data for rare cases
  • Better feature engineering

Fitting issues:

Random Forest (87.78% test, 87.05% CV): - Slight overfitting, but its not an issue - Complex model capturing noise

Logistic Regression (67.28% test, 67.18% CV): - Balanced but scored very low - Too simple for complex patterns

KNN (85.15% test, 83.62% CV): - Moderate overfitting - Sensitive to training data

Decision Tree (79.97% test, 79.18% CV): - Slight overfitting, but its not an issue - Missing some key patterns

All models particularly struggle with minority classes.